# Week 6 Exercise (group): Exploratory Data Analysis on Social Media Data

- Caren Chua 
- Leslie Cohrt
- Sarah Auther
- Shoshana Medved

## 1. Import necessary packages

In [6]:
!pip install textblob

Collecting textblob
  Using cached textblob-0.17.1-py2.py3-none-any.whl (636 kB)
Collecting nltk>=3.1
  Using cached nltk-3.8.1-py3-none-any.whl (1.5 MB)
Collecting regex>=2021.8.3
  Using cached regex-2023.5.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (769 kB)
Collecting click
  Using cached click-8.1.3-py3-none-any.whl (96 kB)
Installing collected packages: regex, click, nltk, textblob
Successfully installed click-8.1.3 nltk-3.8.1 regex-2023.5.5 textblob-0.17.1


In [5]:
!pip install emoji 

Collecting emoji
  Using cached emoji-2.2.0-py3-none-any.whl
Installing collected packages: emoji
Successfully installed emoji-2.2.0


In [7]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [8]:
import emoji

In [9]:
from textblob import TextBlob

In [10]:
import re

In [11]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

## 2. Read the data

The data is called `tweets.csv` in the same folder. More information about the data see [here](https://www.kaggle.com/datasets/infamouscoder/mental-health-social-media)

The main column you will be working with is `post_text`

In [12]:
# df = 

df = pd.read_csv("tweets.csv")

# explore the data characteristic using `df.describe()` or `df.info()`
df.describe()

Unnamed: 0.1,Unnamed: 0,post_id,user_id,followers,friends,favourites,statuses,retweets,label
count,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0
mean,9999.5,6.874728e+17,3.548623e+16,900.48395,782.42875,6398.23555,44394.42,1437.9273,0.5
std,5773.647028,1.708396e+17,1.606083e+17,1899.913961,1834.817945,8393.072914,140778.5,15119.665118,0.500013
min,0.0,3555966000.0,14724380.0,0.0,0.0,0.0,3.0,0.0,0.0
25%,4999.75,5.931686e+17,324294400.0,177.0,211.0,243.0,5129.0,0.0,0.0
50%,9999.5,7.6374e+17,1052122000.0,476.0,561.0,2752.0,13251.0,0.0,0.5
75%,14999.25,8.153124e+17,2285923000.0,1197.0,701.0,8229.0,52892.0,1.0,1.0
max,19999.0,8.194574e+17,7.631825e+17,28614.0,28514.0,39008.0,1063601.0,839540.0,1.0


## 3. Extract emojis

Use `emoji` package to extract emojis and put them into a new column called `emojis`

In [33]:
# define the function
def extract_emojis(text):
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               "]+", flags=re.UNICODE)
    return emoji_pattern.findall(text)

In [35]:
# apply the function to your dataframe
df['emojis'] = df['post_text'].apply(extract_emojis)
df['emojis']

0        []
1        []
2        []
3        []
4        []
         ..
19995    []
19996    []
19997    []
19998    []
19999    []
Name: emojis, Length: 20000, dtype: object

## 4. Text Cleaning using Regular Expressions 

1. Remove URLs
2. Remove mentions
3. Remove hashtags
4. Remove special characters
5. Remove extra space

Code can be found in [week 6 lecture 1](https://github.com/yibeichan/COMM160DS/blob/main/week_6/lecture_part1.ipynb) section `4.4 All-in-One`

Perform the analysis and save the results into a new column.

In [32]:
df["post_text"]

0        It's just over 2 years since I was diagnosed w...
1        It's Sunday, I need a break, so I'm planning t...
2        Awake but tired. I need to sleep but my brain ...
3        RT @SewHQ: #Retro bears make perfect gifts and...
4        It’s hard to say whether packing lists are mak...
                               ...                        
19995                A day without sunshine is like night.
19996    Boren's Laws: (1) When in charge, ponder. (2) ...
19997    The flow chart is a most thoroughly oversold p...
19998    Ships are safe in harbor, but they were never ...
19999       Black holes are where God is dividing by zero.
Name: post_text, Length: 20000, dtype: object

In [35]:
def clean_text(text):
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)  # remove URLs
    text = re.sub(r'@\w+', '', text)  # remove mentions
    text = re.sub(r'@\w+', '', text)  # remove hashtag
    text = re.sub(r'\W', ' ', text)  # remove special characters
    text = re.sub(r'\s+', ' ', text)  # remove extra spaces
    return text.strip()
print(text)

# apply the function to your dataframe

0        It's just over 2 years since I was diagnosed w...
1        It's Sunday, I need a break, so I'm planning t...
2        Awake but tired. I need to sleep but my brain ...
3        RT @SewHQ: #Retro bears make perfect gifts and...
4        It’s hard to say whether packing lists are mak...
                               ...                        
19995                A day without sunshine is like night.
19996    Boren's Laws: (1) When in charge, ponder. (2) ...
19997    The flow chart is a most thoroughly oversold p...
19998    Ships are safe in harbor, but they were never ...
19999       Black holes are where God is dividing by zero.
Name: post_text, Length: 20000, dtype: object


## 5. Analysis 1 (Sentiment Analysis)

Choose one analysis from (1)Sentiment Analysis, (2)N-grams and Phrase Analysis, (3)Collocation Analysis, (4)Part-of-Speech Tagging, (5)Named Entity Recognition, and (6)Dependency Parsing.

Perform the analysis and save the results into a new column.

In [38]:
# write your code here

from nltk.sentiment import SentimentIntensityAnalyzer

def get_sentiment(text):
    analysis = TextBlob(text)
    if analysis.sentiment.polarity > 0:
        return 'positive'
    elif analysis.sentiment.polarity == 0:
        return 'neutral'
    else:
        return 'negative'

In [41]:
sia = SentimentIntensityAnalyzer()
df['sent_an'] = df['post_text'].apply(lambda x: sia.polarity_scores(x))
df['sent_an']

0        {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...
1        {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...
2        {'neg': 0.259, 'neu': 0.741, 'pos': 0.0, 'comp...
3        {'neg': 0.0, 'neu': 0.715, 'pos': 0.285, 'comp...
4        {'neg': 0.06, 'neu': 0.819, 'pos': 0.121, 'com...
                               ...                        
19995    {'neg': 0.542, 'neu': 0.458, 'pos': 0.0, 'comp...
19996    {'neg': 0.257, 'neu': 0.743, 'pos': 0.0, 'comp...
19997    {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...
19998    {'neg': 0.0, 'neu': 0.86, 'pos': 0.14, 'compou...
19999    {'neg': 0.0, 'neu': 0.792, 'pos': 0.208, 'comp...
Name: sent_an, Length: 20000, dtype: object

## 6. Analysis 2 (N-grams)

Choose another analysis from (1)Sentiment Analysis, (2)N-grams and Phrase Analysis, (3)Collocation Analysis, (4)Part-of-Speech Tagging, (5)Named Entity Recognition, and (6)Dependency Parsing.

Perform the analysis and save the results into a new column.

In [40]:
# write your code here
from nltk import ngrams

def generate_ngrams(text, n):
    tokens = text.split()
    return list(ngrams(tokens, n))

In [44]:
df['ngram'] = df['post_text'].apply(lambda x: generate_ngrams(x, n=2))
df['ngram']

0        [(It's, just), (just, over), (over, 2), (2, ye...
1        [(It's, Sunday,), (Sunday,, I), (I, need), (ne...
2        [(Awake, but), (but, tired.), (tired., I), (I,...
3        [(RT, @SewHQ:), (@SewHQ:, #Retro), (#Retro, be...
4        [(It’s, hard), (hard, to), (to, say), (say, wh...
                               ...                        
19995    [(A, day), (day, without), (without, sunshine)...
19996    [(Boren's, Laws:), (Laws:, (1)), ((1), When), ...
19997    [(The, flow), (flow, chart), (chart, is), (is,...
19998    [(Ships, are), (are, safe), (safe, in), (in, h...
19999    [(Black, holes), (holes, are), (are, where), (...
Name: ngram, Length: 20000, dtype: object

## 7. Push Your Results to GitHub

As you did in previous weeks:
1. `git status`
2. `git add`
3. `git commit -m "type your message here"`
4. `git push`

If you can't push it to GitHub, it's okay to manually uploaded it.