# Week 4 Exercise (group): Exploratory Data Analysis on Social Media Data

- Vicky Xie
- Riley Smith
- Finnley O'Rouke
- Kayla Katakis

## 1. Import necessary packages

In [12]:
import pandas as pd

## 2. Read the data

The data is called `tweets.csv` in the same folder. More information about the data see [here](https://www.kaggle.com/datasets/infamouscoder/mental-health-social-media)

The main column you will be working with is `post_text`

In [13]:
# df = 
df=pd.read_csv("tweets.csv")
# explore the data characteristic using `df.describe()` or `df.info()`
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Unnamed: 0    20000 non-null  int64 
 1   post_id       20000 non-null  int64 
 2   post_created  20000 non-null  object
 3   post_text     20000 non-null  object
 4   user_id       20000 non-null  int64 
 5   followers     20000 non-null  int64 
 6   friends       20000 non-null  int64 
 7   favourites    20000 non-null  int64 
 8   statuses      20000 non-null  int64 
 9   retweets      20000 non-null  int64 
 10  label         20000 non-null  int64 
dtypes: int64(9), object(2)
memory usage: 1.7+ MB


## 3. Extract emojis

Use `emoji` package to extract emojis and put them into a new column called `emojis`

In [14]:
!pip install emoji
import emoji
import re

Collecting emoji
  Using cached emoji-2.2.0-py3-none-any.whl
Installing collected packages: emoji
Successfully installed emoji-2.2.0


In [15]:
# define the function

def extract_emojis(text):
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               "]+", flags=re.UNICODE)
    return emoji_pattern.findall(text)

# apply the function to your dataframe
df['emojis'] = df['post_text'].apply(extract_emojis)
df['emojis'].head()

0    []
1    []
2    []
3    []
4    []
Name: emojis, dtype: object

## 4. Text Cleaning using Regular Expressions 

1. Remove URLs
2. Remove mentions
3. Remove hashtags
4. Remove special characters
5. Remove extra space

Code can be found in [week 6 lecture 1](https://github.com/yibeichan/COMM160DS/blob/main/week_6/lecture_part1.ipynb) section `4.4 All-in-One`

Perform the analysis and save the results into a new column.

In [56]:
# define the function
def clean_text(text):
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)  # remove URLs
    text = re.sub(r'@\w+', '', text)  # remove mentions
    text = re.sub(r'#\w+', '', text) # remove hashtag
    text = re.sub(r'\W', ' ', text)  # remove special characters
    text = re.sub(r'\s+', ' ', text)  # remove extra spaces
    return text.strip()

# apply the function to your dataframe
text = df["post_text"]
df["cleaned_text"] = df["post_text"].apply(clean_text)
df["cleaned_text"].head()

0    It s just over 2 years since I was diagnosed w...
1    It s Sunday I need a break so I m planning to ...
2    Awake but tired I need to sleep but my brain h...
3    RT bears make perfect gifts and are great for ...
4    It s hard to say whether packing lists are mak...
Name: cleaned_text, dtype: object

## 5. Analysis 1 (Sentiment Analysis)

Choose one analysis from (1)Sentiment Analysis, (2)N-grams and Phrase Analysis, (3)Collocation Analysis, (4)Part-of-Speech Tagging, (5)Named Entity Recognition, and (6)Dependency Parsing.

Perform the analysis and save the results into a new column.

In [57]:
!pip install textblob



In [58]:
# write your code here

from textblob import TextBlob

def get_sentiment(text):
    analysis = TextBlob(text)
    if analysis.sentiment.polarity > 0:
        return 'positive'
    elif analysis.sentiment.polarity == 0:
        return 'neutral'
    else:
        return 'negative'

sentiment = df["cleaned_text"].apply(get_sentiment)
sentiment

0        positive
1        negative
2        negative
3        positive
4        negative
           ...   
19995     neutral
19996    negative
19997    positive
19998    positive
19999    negative
Name: cleaned_text, Length: 20000, dtype: object

## 6. Analysis 2 (N-grams and Phrase Analysis)

Choose another analysis from (1)Sentiment Analysis, (2)N-grams and Phrase Analysis, (3)Collocation Analysis, (4)Part-of-Speech Tagging, (5)Named Entity Recognition, and (6)Dependency Parsing.

Perform the analysis and save the results into a new column.

In [66]:
# write your code here
from nltk import ngrams

def generate_ngrams(text, n):
    tokens = text.split()
    return list(ngrams(tokens, n))

In [67]:
df["n_grams"] =df["cleaned_text"].apply(lambda x: generate_ngrams (x, n=2))

In [68]:
df["n_grams"]

0        [(It, s), (s, just), (just, over), (over, 2), ...
1        [(It, s), (s, Sunday), (Sunday, I), (I, need),...
2        [(Awake, but), (but, tired), (tired, I), (I, n...
3        [(RT, bears), (bears, make), (make, perfect), ...
4        [(It, s), (s, hard), (hard, to), (to, say), (s...
                               ...                        
19995    [(A, day), (day, without), (without, sunshine)...
19996    [(Boren, s), (s, Laws), (Laws, 1), (1, When), ...
19997    [(The, flow), (flow, chart), (chart, is), (is,...
19998    [(Ships, are), (are, safe), (safe, in), (in, h...
19999    [(Black, holes), (holes, are), (are, where), (...
Name: n_grams, Length: 20000, dtype: object

## 7. Push Your Results to GitHub

As you did in previous weeks:
1. `git status`
2. `git add`
3. `git commit -m "type your message here"`
4. `git push`

If you can't push it to GitHub, it's okay to manually uploaded it.