# Exploratory Data Analysis on Social Media Data

by Andrew Chang, Chloe Din-Luong

The data is in uncleaned format and is collected using Twitter API. The Tweets has been filtered to keep only the English context. It targets mental health classification of the user at Tweet-level. This project will demonstrate data cleaning in order to begin an Exploratory Data Analysis on Social Media Data.

## 1. Import necessary packages

In [48]:
import pandas as pd
import emoji
import re
from textblob import TextBlob
from nltk import ngrams

## 2. Read the data

The data is called `tweets.csv` in the same folder. More information about the data see [here](https://www.kaggle.com/datasets/infamouscoder/mental-health-social-media)

The main column you will be working with is `post_text`

In [17]:
df = pd.read_csv('tweets.csv')

# explore the data characteristic using `df.describe()` or `df.info()`
df.info()
df.describe()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Unnamed: 0    20000 non-null  int64 
 1   post_id       20000 non-null  int64 
 2   post_created  20000 non-null  object
 3   post_text     20000 non-null  object
 4   user_id       20000 non-null  int64 
 5   followers     20000 non-null  int64 
 6   friends       20000 non-null  int64 
 7   favourites    20000 non-null  int64 
 8   statuses      20000 non-null  int64 
 9   retweets      20000 non-null  int64 
 10  label         20000 non-null  int64 
dtypes: int64(9), object(2)
memory usage: 1.7+ MB


Unnamed: 0.1,Unnamed: 0,post_id,post_created,post_text,user_id,followers,friends,favourites,statuses,retweets,label
0,0,637894677824413696,Sun Aug 30 07:48:37 +0000 2015,It's just over 2 years since I was diagnosed w...,1013187241,84,211,251,837,0,1
1,1,637890384576778240,Sun Aug 30 07:31:33 +0000 2015,"It's Sunday, I need a break, so I'm planning t...",1013187241,84,211,251,837,1,1
2,2,637749345908051968,Sat Aug 29 22:11:07 +0000 2015,Awake but tired. I need to sleep but my brain ...,1013187241,84,211,251,837,0,1
3,3,637696421077123073,Sat Aug 29 18:40:49 +0000 2015,RT @SewHQ: #Retro bears make perfect gifts and...,1013187241,84,211,251,837,2,1
4,4,637696327485366272,Sat Aug 29 18:40:26 +0000 2015,It’s hard to say whether packing lists are mak...,1013187241,84,211,251,837,1,1


## 3. Extract emojis

In [30]:
# define the function
def extract_emoji(post_text):
     return ''.join(c for c in post_text if c in [u"\U0001F600-\U0001F64F", u"\U0001F300-\U0001F5FF", u"\U0001F680-\U0001F6FF", u"\U0001F1E0-\U0001F1FF"])

# apply the function to your dataframe
df['emojis'] = df['post_text'].apply(extract_emoji)

## 4. Text Cleaning using Regular Expressions 

In [38]:
# define the function
def clean_text(text):
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)  # remove URLs
    text = re.sub(r'@\w+', '', text)  # remove mentions
    text = re.sub(r'\W', ' ', text)  # remove special characters
    text = re.sub(r'\s+', ' ', text)  # remove extra spaces
    text = re.sub(r'#\w+', '', text)  # remove hastags
    text = re.sub(r'https?://\S+|www.\S+', '', text) # remove urls
    return text.strip()

# apply the function to your dataframe
df['cleaned_text'] = df["post_text"].apply(clean_text)

## 5. Sentiment Analysis 

In [47]:
def get_sentiment(cleaned_text):
    analysis = TextBlob('cleaned_text')
    if analysis.sentiment.polarity > 0:
        return 'positive'
    elif analysis.sentiment.polarity == 0:
        return 'neutral'
    else:
        return 'negative'

sentiment = df["cleaned_text"].apply(get_sentiment)

## 6. N-Grams and Phrase Analysis

In [None]:
# write your code here
def generate_ngrams(text, n):
    tokens = text.split()
    return list(ngrams(tokens, n))

df["n_grams"] =df["cleaned_text"].apply(lambda x: generate_ngrams (x, n=2))