# Week 4 Exercise (group): Exploratory Data Analysis on Social Media Data

- kristina dyrvik 
- ...

## 1. Import necessary packages

In [23]:
!pip install nltk
import nltk
from nltk import ngrams
import pandas as pd
!pip install emoji
import emoji
import re
import seaborn as sns
import matplotlib.pyplot as plt



In [24]:
!pip install openpyxl



## 2. Read the data

The data is called `tweets.csv` in the same folder. More information about the data see [here](https://www.kaggle.com/datasets/infamouscoder/mental-health-social-media)

The main column you will be working with is `post_text`

In [30]:
df = pd.read_csv("tweets.csv")
df.describe()

Unnamed: 0.1,Unnamed: 0,post_id,user_id,followers,friends,favourites,statuses,retweets,label
count,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0
mean,9999.5,6.874728e+17,3.548623e+16,900.48395,782.42875,6398.23555,44394.42,1437.9273,0.5
std,5773.647028,1.708396e+17,1.606083e+17,1899.913961,1834.817945,8393.072914,140778.5,15119.665118,0.500013
min,0.0,3555966000.0,14724380.0,0.0,0.0,0.0,3.0,0.0,0.0
25%,4999.75,5.931686e+17,324294400.0,177.0,211.0,243.0,5129.0,0.0,0.0
50%,9999.5,7.6374e+17,1052122000.0,476.0,561.0,2752.0,13251.0,0.0,0.5
75%,14999.25,8.153124e+17,2285923000.0,1197.0,701.0,8229.0,52892.0,1.0,1.0
max,19999.0,8.194574e+17,7.631825e+17,28614.0,28514.0,39008.0,1063601.0,839540.0,1.0


In [31]:
df.info(["post_text"])

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Unnamed: 0    20000 non-null  int64 
 1   post_id       20000 non-null  int64 
 2   post_created  20000 non-null  object
 3   post_text     20000 non-null  object
 4   user_id       20000 non-null  int64 
 5   followers     20000 non-null  int64 
 6   friends       20000 non-null  int64 
 7   favourites    20000 non-null  int64 
 8   statuses      20000 non-null  int64 
 9   retweets      20000 non-null  int64 
 10  label         20000 non-null  int64 
dtypes: int64(9), object(2)
memory usage: 1.7+ MB


## 3. Extract emojis

Use `emoji` package to extract emojis and put them into a new column called `emojis`

In [52]:
def extract_emojis(text):
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               "]+", flags=re.UNICODE)
    return emoji_pattern.findall(text)

In [54]:
emojis_tweets = df['post_text'].apply(extract_emojis)
emojis_tweets.head()

0    []
1    []
2    []
3    []
4    []
Name: post_text, dtype: object

## 4. Text Cleaning using Regular Expressions 

1. Remove URLs
2. Remove mentions
3. Remove hashtags
4. Remove special characters
5. Remove extra space

Code can be found in [week 6 lecture 1](https://github.com/yibeichan/COMM160DS/blob/main/week_6/lecture_part1.ipynb) section `4.4 All-in-One`

Perform the analysis and save the results into a new column.

In [57]:
# define the function
def clean_tweets(text):
    text = re.sub(r'https?://\S+|www.\S+', '', text, flags=re.MULTILINE) # removed URLs
    text = re.sub(r'#\w+', '', text) # removed hashtags
    text = re.sub(r'@\w+', '', text)  # remove mentions
    text = re.sub(r'\W', ' ', text)  # remove special characters
    text = re.sub(r'\s+', ' ', text)  # remove extra spaces
    return text.strip()

# apply the function to your dataframe
df['cleaned_tweets'] = df['post_text'].apply(clean_tweets)

In [61]:
df['cleaned_tweets'].head()

0    It s just over 2 years since I was diagnosed w...
1    It s Sunday I need a break so I m planning to ...
2    Awake but tired I need to sleep but my brain h...
3    RT bears make perfect gifts and are great for ...
4    It s hard to say whether packing lists are mak...
Name: cleaned_tweets, dtype: object

## 5. Part-of-Speech Tagging

Choose one analysis from (1)Sentiment Analysis, (2)N-grams and Phrase Analysis, (3)Collocation Analysis, (4)Part-of-Speech Tagging, (5)Named Entity Recognition, and (6)Dependency Parsing.

Perform the analysis and save the results into a new column.

In [None]:
nltk.download('averaged_perceptron_tagger')

In [None]:
def pos_tagging_tweets(text):
    tokens = nltk.word_tokenize(text)
    tagged_tokens = nltk.pos_tag(tokens)
    return tagged_tokens

df['tagged pos tweets'] = df['post_text'].apply(pos_tagging_tweets)
df['tagged pos tweets'].head()

## 6. N-grams and Phrase Analysis

Choose another analysis from (1)Sentiment Analysis, (2)N-grams and Phrase Analysis, (3)Collocation Analysis, (4)Part-of-Speech Tagging, (5)Named Entity Recognition, and (6)Dependency Parsing.

Perform the analysis and save the results into a new column.

In [64]:
from nltk import ngrams

def generate_ngrams(text, n):
    tokens = text.split()
    return list(ngrams(tokens, n))

df['n_grams'] = df['post_text'].apply(lambda x: generate_ngrams(x, n=2))
df['n_grams'].head()

0    [(It's, just), (just, over), (over, 2), (2, ye...
1    [(It's, Sunday,), (Sunday,, I), (I, need), (ne...
2    [(Awake, but), (but, tired.), (tired., I), (I,...
3    [(RT, @SewHQ:), (@SewHQ:, #Retro), (#Retro, be...
4    [(It’s, hard), (hard, to), (to, say), (say, wh...
Name: n_grams, dtype: object

In [67]:
from collections import Counter

df['n_gram_counts'] = df['n_grams'].apply(Counter)
df['n_gram_counts'].head()

0    {('It's', 'just'): 1, ('just', 'over'): 1, ('o...
1    {('It's', 'Sunday,'): 1, ('Sunday,', 'I'): 1, ...
2    {('Awake', 'but'): 1, ('but', 'tired.'): 1, ('...
3    {('RT', '@SewHQ:'): 1, ('@SewHQ:', '#Retro'): ...
4    {('It’s', 'hard'): 1, ('hard', 'to'): 1, ('to'...
Name: n_gram_counts, dtype: object

## 7. Push Your Results to GitHub

As you did in previous weeks:
1. `git status`
2. `git add`
3. `git commit -m "type your message here"`
4. `git push`

If you can't push it to GitHub, it's okay to manually uploaded it.