# Week 4 Exercise (group): Exploratory Data Analysis on Social Media Data

- Cisco Yang
- Zhukang Qin
- Ashley Wang
- Wenbo Wei

## 1. Import necessary packages

In [39]:
# import necessary packages here
!pip install nltk
import nltk
!pip install emoji
import emoji
import pandas as pd
import re



## 2. Read the data

The data is called `tweets.csv` in the same folder. More information about the data see [here](https://www.kaggle.com/datasets/infamouscoder/mental-health-social-media)

The main column you will be working with is `post_text`

In [24]:
# explore the data characteristic using `df.describe()` or `df.info()`
csv_file = "tweets.csv"
df = pd.read_csv(csv_file)
df
df1 = df["post_text"]
df.info()
df1

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Unnamed: 0    20000 non-null  int64 
 1   post_id       20000 non-null  int64 
 2   post_created  20000 non-null  object
 3   post_text     20000 non-null  object
 4   user_id       20000 non-null  int64 
 5   followers     20000 non-null  int64 
 6   friends       20000 non-null  int64 
 7   favourites    20000 non-null  int64 
 8   statuses      20000 non-null  int64 
 9   retweets      20000 non-null  int64 
 10  label         20000 non-null  int64 
dtypes: int64(9), object(2)
memory usage: 1.7+ MB


0        It's just over 2 years since I was diagnosed w...
1        It's Sunday, I need a break, so I'm planning t...
2        Awake but tired. I need to sleep but my brain ...
3        RT @SewHQ: #Retro bears make perfect gifts and...
4        It’s hard to say whether packing lists are mak...
                               ...                        
19995                A day without sunshine is like night.
19996    Boren's Laws: (1) When in charge, ponder. (2) ...
19997    The flow chart is a most thoroughly oversold p...
19998    Ships are safe in harbor, but they were never ...
19999       Black holes are where God is dividing by zero.
Name: post_text, Length: 20000, dtype: object

## 3. Extract emojis

Use `emoji` package to extract emojis and put them into a new column called `emojis`

In [40]:
# define the function
def extract_emojis(text):
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  
                               u"\U0001F300-\U0001F5FF"  
                               u"\U0001F680-\U0001F6FF"  
                               u"\U0001F1E0-\U0001F1FF"  
                               "]+", flags=re.UNICODE)
    return emoji_pattern.findall(text)
# apply the function to your dataframe
df['emojis'] = df['post_text'].apply(extract_emojis)
df['emojis']

0        []
1        []
2        []
3        []
4        []
         ..
19995    []
19996    []
19997    []
19998    []
19999    []
Name: emojis, Length: 20000, dtype: object

## 4. Text Cleaning using Regular Expressions 

1. Remove URLs
2. Remove mentions
3. Remove hashtags
4. Remove special characters
5. Remove extra space

Code can be found in [week 6 lecture 1](https://github.com/yibeichan/COMM160DS/blob/main/week_6/lecture_part1.ipynb) section `4.4 All-in-One`

Perform the analysis and save the results into a new column.

In [50]:
# define the function
def clean_tweets(text):
    text = re.sub(r'https?://\S+|www.\S+', '', text, flags=re.MULTILINE) # removed URLs
    text = re.sub(r'#\w+', '', text) # removed hashtags
    text = re.sub(r'@\w+', '', text)  # remove mentions
    text = re.sub(r'\W', ' ', text)  # remove special characters
    text = re.sub(r'\s+', ' ', text)  # remove extra spaces
    return text.strip()

# apply the function to your dataframe
df['cleaned_tweets'] = df['post_text'].apply(clean_tweets)
df['cleaned_tweets'].head()

0    It s just over 2 years since I was diagnosed w...
1    It s Sunday I need a break so I m planning to ...
2    Awake but tired I need to sleep but my brain h...
3    RT bears make perfect gifts and are great for ...
4    It s hard to say whether packing lists are mak...
Name: cleaned_tweets, dtype: object

## 5. Analysis 1 (Sentiment Analysis)

Choose one analysis from (1)Sentiment Analysis, (2)N-grams and Phrase Analysis, (3)Collocation Analysis, (4)Part-of-Speech Tagging, (5)Named Entity Recognition, and (6)Dependency Parsing.

Perform the analysis and save the results into a new column.

In [56]:
# write your code here
!pip install textblob
from textblob import TextBlob

def get_sentiment(text):
    analysis = TextBlob(text)
    if analysis.sentiment.polarity > 0:
        return 'positive'
    elif analysis.sentiment.polarity == 0:
        return 'neutral'
    else:
        return 'negative'
df["get_sentiment"] = df['post_text'].apply(get_sentiment)



0        positive
1        negative
2        negative
3        positive
4        negative
           ...   
19995     neutral
19996    negative
19997    positive
19998    positive
19999    negative
Name: get_sentiment, Length: 20000, dtype: object

In [57]:
df["get_sentiment"]

0        positive
1        negative
2        negative
3        positive
4        negative
           ...   
19995     neutral
19996    negative
19997    positive
19998    positive
19999    negative
Name: get_sentiment, Length: 20000, dtype: object

## 6. Analysis 2 (N-grams and Phrase Analysis)

Choose another analysis from (1)Sentiment Analysis, (2)N-grams and Phrase Analysis, (3)Collocation Analysis, (4)Part-of-Speech Tagging, (5)Named Entity Recognition, and (6)Dependency Parsing.

Perform the analysis and save the results into a new column.

In [58]:
# write your code here
!pip install nltk
from nltk import ngrams
def generate_ngrams(text, n):
    tokens = text.split()
    return list(ngrams(tokens, n))
df["bigrams"] = df['post_text'].apply(generate_ngrams, n=5)



In [59]:
df["bigrams"]

0        [(It's, just, over, 2, years), (just, over, 2,...
1        [(It's, Sunday,, I, need, a), (Sunday,, I, nee...
2        [(Awake, but, tired., I, need), (but, tired., ...
3        [(RT, @SewHQ:, #Retro, bears, make), (@SewHQ:,...
4        [(It’s, hard, to, say, whether), (hard, to, sa...
                               ...                        
19995    [(A, day, without, sunshine, is), (day, withou...
19996    [(Boren's, Laws:, (1), When, in), (Laws:, (1),...
19997    [(The, flow, chart, is, a), (flow, chart, is, ...
19998    [(Ships, are, safe, in, harbor,), (are, safe, ...
19999    [(Black, holes, are, where, God), (holes, are,...
Name: bigrams, Length: 20000, dtype: object

## 7. Push Your Results to GitHub

As you did in previous weeks:
1. `git status`
2. `git add`
3. `git commit -m "type your message here"`
4. `git push`

If you can't push it to GitHub, it's okay to manually uploaded it.