# Project 3 
The objective of this project is to perform word frequency analysis. This link provides Twitter data of Elon Musk from 2010-2022. For analysis consider the years `2017-2021` (last 5 complete years). Each year has thousands of tweets. Assume each year to be a document (all the tweets in one year will be considered as a document) 
1. Compute the term frequencies for each year. They should be normalized (scale of [0, 1]). Exclude stopwords. 
2. Show the top 10 words (for each year) by highest value of word frequency. 
3. Plot a histogram of word frequencies for each year 
4. Demonstrate Zipf’s law by plotting log-log plots of word frequencies v. rank for each year 5. Use TF-IDF to calculate and show the 5 most “important” words for each y    
https://www.kaggle.com/datasets/ayhmrba/elon-musk-tweets-2010-2021?resource=download&select=2017.csv

# Data Description
## About Dataset
Elon Musk Tweets (2010 - 2021)
All Elon Musk Twitter Tweets, from 2010 to March 22, 2021.
23/3/2021
Elon Reeve Musk FRS is a business magnate, industrial designer, and engineer. He is the founder, CEO, CTO, and chief designer of SpaceX; early investor, CEO, and product architect of Tesla, Inc.; founder of The Boring Company; co-founder of Neuralink; and co-founder and initial co-chairman of OpenAI. - Wikipedia
Although Elon joined twitter in 2009, **he didn't start tweeting until 2010 - or prior tweets were deleted-.**

## Column Descriptions:
#: Index.
            `id: ID of tweet`.
            conversation_id:: ID of twitter conversation/thread.
            created_at: Unknown, some kind of time/location index from twitter. (?)
            `date: Date of Creation`.
            timezone: Timezone.
            place: Location.
            `tweet: Contents of tweet, tweet body.`
            `language: Language of tweet.`
            hashtags: Hashtags in the tweet "#".
            cashtags: Cashtags in the tweet "$", often used for stock tweets.
            `user_id: ID of the tweet/reply author.`
            `user_id_str: User ID but in string format.`
            `username: Username of the tweet/reply author.`
            name: Name of tweet/reply author.
            day: Day of the week in which the tweet was published.
            hour: Hour of the day in which the tweet was published.
            link: Link to the tweet.
            urls: Urls present in the tweet.
            photos: Photos in the tweet (as links).
            video: videos in the tweet (Yes/No).
            thumbnail: Thumbnail for the image present in the tweet (if applicable, otherwise null).
            `retweet: Is this a retweet? (Yes/No).`
            nlikes: Number of likes on the tweet.
            nreplies: Number of replies to the tweet.
            nretweets: Number of times the tweet was retweeted.
            quote_url: Url of quoted tweet, if applicable.
            search: Unknown.
            near: Additional location info, null.
            geo: Additional location info, null.
            source: Unknown, null.
            user_rt_id: Possibly the id of the tweet author if it's a retweet, null.
            user_rt: Possibly the username of the tweet author if it's a retweet, null.
            retweet_id: Id of the retweet, null.
            reply_to: Info about the original tweet if this datapoint is a reply.
            retweet_date: Date of retweet, null.
            translate, trans_src, trans_dest Are columns that have to do with the google translate api, which was not used, therefore these columns are all null.
            translate, trans_src, trans_dest Are columns that have to do with the google translate api, which was not used, therefore these columns are all null.

Elon Musk: https://en.wikipedia.org/wiki/Elon_Musk
Twitter: https://twitter.com
Elon Musk on Twitter: https://twitter.com/elonmusk

# Pre- work

In [112]:
# import libs
# basic libs  
import pandas as pd
import numpy as np
import math
import os

# visualization libs
import matplotlib.pyplot as plt
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator


import glob
from pylab import *

# models libs
import re
import nltk
from nltk.corpus import stopwords

# stopwords
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/audrey/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [106]:
#! pip install WordCloud



In [86]:
os.getcwd()

'/Users/audrey/Documents/0_NEU/IE6400/project3'

In [87]:
# read data from year 2017-2021
df2017 = pd.read_csv("archive/2017.csv")
df2018 = pd.read_csv("archive/2018.csv")
df2019 = pd.read_csv("archive/2019.csv")
df2020 = pd.read_csv("archive/2020.csv")
df2021 = pd.read_csv("archive/2021.csv")


# Understand and process data

In [88]:
df2017.head()

Unnamed: 0.1,Unnamed: 0,id,conversation_id,created_at,date,timezone,place,tweet,language,hashtags,...,geo,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date,translate,trans_src,trans_dest
0,0,945814723521417217,945712432416137217,1514335000000.0,2017-12-27 00:32:57,0,,@neilsiegel @Tesla Coming very soon,en,[],...,,,,,,"[{'screen_name': 'neilsiegel', 'name': 'Neil S...",,,,
1,1,945749747129659392,945712432416137217,1514319000000.0,2017-12-26 20:14:45,0,,@Kreative Vastly better maps/nav coming soon,en,[],...,,,,,,"[{'screen_name': 'Kreative', 'name': 'Leslie',...",,,,
2,2,945748731197980672,945712432416137217,1514319000000.0,2017-12-26 20:10:43,0,,@dd_hogan Ok,und,[],...,,,,,,"[{'screen_name': 'dd_hogan', 'name': 'Live4EVD...",,,,
3,3,945730195113365504,945727773493968896,1514315000000.0,2017-12-26 18:57:03,0,,@Jason @Tesla Sure,en,[],...,,,,,,"[{'screen_name': 'Jason', 'name': 'jason@calac...",,,,
4,4,945729852874694656,945712432416137217,1514315000000.0,2017-12-26 18:55:42,0,,"@kabirakhtar Yeah, it’s terrible. Had to upgra...",en,[],...,,,,,,"[{'screen_name': 'kabirakhtar', 'name': 'kabir...",,,,


In [89]:
df2017.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3167 entries, 0 to 3166
Data columns (total 39 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Unnamed: 0       3167 non-null   int64  
 1   id               3167 non-null   int64  
 2   conversation_id  3167 non-null   int64  
 3   created_at       3167 non-null   float64
 4   date             3167 non-null   object 
 5   timezone         3167 non-null   int64  
 6   place            0 non-null      float64
 7   tweet            3167 non-null   object 
 8   language         3167 non-null   object 
 9   hashtags         3167 non-null   object 
 10  cashtags         3167 non-null   object 
 11  user_id          3167 non-null   int64  
 12  user_id_str      3167 non-null   int64  
 13  username         3167 non-null   object 
 14  name             3167 non-null   object 
 15  day              3167 non-null   int64  
 16  hour             3167 non-null   int64  
 17  link          

In [90]:
df2017.describe()

Unnamed: 0.1,Unnamed: 0,id,conversation_id,created_at,timezone,place,user_id,user_id_str,day,hour,...,near,geo,source,user_rt_id,user_rt,retweet_id,retweet_date,translate,trans_src,trans_dest
count,3167.0,3167.0,3167.0,3167.0,3167.0,0.0,3167.0,3167.0,3167.0,3167.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mean,1583.0,6.590143e+17,6.587789e+17,1445952000000.0,0.0,,44196397.0,44196397.0,4.233344,12.504263,...,,,,,,,,,,
std,914.378477,2.379594e+17,2.378503e+17,56745930000.0,0.0,,0.0,0.0,1.954356,7.701672,...,,,,,,,,,,
min,0.0,15434730000.0,15434730000.0,1275676000000.0,0.0,,44196397.0,44196397.0,1.0,0.0,...,,,,,,,,,,
25%,791.5,4.572019e+17,4.572013e+17,1397840000000.0,0.0,,44196397.0,44196397.0,3.0,5.0,...,,,,,,,,,,
50%,1583.0,7.430977e+17,7.430971e+17,1466003000000.0,0.0,,44196397.0,44196397.0,4.0,15.0,...,,,,,,,,,,
75%,2374.5,8.622148e+17,8.621445e+17,1494403000000.0,0.0,,44196397.0,44196397.0,6.0,19.0,...,,,,,,,,,,
max,3166.0,9.458147e+17,9.457278e+17,1514335000000.0,0.0,,44196397.0,44196397.0,7.0,23.0,...,,,,,,,,,,


In [91]:
# understand data

# check they are all from Musk's tweeter account
# username: Username of the tweet/reply author.
print("num of username: ", len(unique(df2017["username"])), unique(df2017["username"]))
# name: Name of tweet/reply author.
print("num of name: ", len(unique(df2017["name"])), unique(df2017["name"]))


# id: ID of tweet. 
print("num of ID/tweet: ", len(unique(df2017["id"])))

num of username:  1 ['elonmusk']
num of name:  1 ['Elon Musk']
num of ID/tweet:  3167


In [92]:
# check "date", date: Date of Creation
# output shows that there are data from other years in a single year dataframe.
print("df2017:", len(df2017), unique(pd.to_datetime(df2017["date"]).dt.year))
print("df2018:", len(df2018), unique(pd.to_datetime(df2018["date"]).dt.year))
print("df2019:", len(df2019), unique(pd.to_datetime(df2019["date"]).dt.year))
print("df2020:", len(df2020), unique(pd.to_datetime(df2020["date"]).dt.year))
print("df2021:", len(df2021), unique(pd.to_datetime(df2021["date"]).dt.year))

# output tells us each file is not represent its real year data

df2017: 3167 [2010 2011 2012 2013 2014 2015 2016 2017]
df2018: 2285 [2018]
df2019: 8312 [2010 2011 2012 2013 2014 2015 2016 2017 2018 2019]
df2020: 11717 [2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020]
df2021: 3115 [2021]


In [175]:
# concat df and process all
df = pd.concat([df2017, df2018, df2019, df2020, df2021])

In [94]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 28596 entries, 0 to 3114
Data columns (total 44 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Unnamed: 0       25481 non-null  float64
 1   id               28596 non-null  int64  
 2   conversation_id  28596 non-null  int64  
 3   created_at       28596 non-null  object 
 4   date             28596 non-null  object 
 5   timezone         28596 non-null  int64  
 6   place            0 non-null      float64
 7   tweet            28596 non-null  object 
 8   language         28596 non-null  object 
 9   hashtags         28596 non-null  object 
 10  cashtags         28596 non-null  object 
 11  user_id          28596 non-null  int64  
 12  user_id_str      25481 non-null  float64
 13  username         28596 non-null  object 
 14  name             28596 non-null  object 
 15  day              25481 non-null  float64
 16  hour             25481 non-null  float64
 17  link         

In [176]:
# convert data type
df["date"] = pd.to_datetime(df["date"])

In [177]:
# data description tells us that id is primary key for one tweet poated, so let's see this:
print("df:", len(df), len(unique(df["id"])), unique(df["date"].dt.year))

df: 28596 14832 [2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021]


In [178]:
# so we need to remove duplicates:
df = df.drop_duplicates(subset=['id'])
print("df:", len(df), len(unique(df["id"])), unique(df["date"].dt.year))

df: 14832 14832 [2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021]


In [179]:
# then we filter 5 complete years required
df = df.loc[(df["date"].dt.year >= 2017) & (df["date"].dt.year <= 2021)]
print("df:", len(df), unique(df["date"].dt.year))

df: 12826 [2017 2018 2019 2020 2021]


In [99]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12826 entries, 0 to 3114
Data columns (total 44 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Unnamed: 0       9711 non-null   float64       
 1   id               12826 non-null  int64         
 2   conversation_id  12826 non-null  int64         
 3   created_at       12826 non-null  object        
 4   date             12826 non-null  datetime64[ns]
 5   timezone         12826 non-null  int64         
 6   place            0 non-null      float64       
 7   tweet            12826 non-null  object        
 8   language         12826 non-null  object        
 9   hashtags         12826 non-null  object        
 10  cashtags         12826 non-null  object        
 11  user_id          12826 non-null  int64         
 12  user_id_str      9711 non-null   float64       
 13  username         12826 non-null  object        
 14  name             12826 non-null  object

In [180]:
# again check they are all from Musk's tweeter account
# username: Username of the tweet/reply author.
print("num of username: ", len(unique(df2017["username"])), unique(df2017["username"]))
# name: Name of tweet/reply author.
print("num of name: ", len(unique(df2017["name"])), unique(df2017["name"]))

num of username:  1 ['elonmusk']
num of name:  1 ['Elon Musk']


In [181]:
# see what languages he used
unique(df.language)
df.groupby(["language"])["language"].agg("count").sort_values(ascending=False)

# output: "und" is "Undetermined language, in ISO 639-3 language code"
# import libs about SO 639-3 language code?

language
en     11052
und     1346
tl       132
de        41
fr        39
es        33
in        24
ru        15
pt        11
pl        11
nl        11
it        11
et        10
ca         9
tr         9
da         9
lt         7
cy         6
hi         5
no         5
ro         5
is         4
ht         4
eu         3
sl         3
hu         3
ja         3
sv         3
vi         2
lv         2
fi         2
cs         2
sr         1
el         1
uk         1
ar         1
Name: language, dtype: int64

In [192]:
en = df[df["language"]=="en"]
len(en)

11052

In [147]:
#en.to_csv("en.csv")

In [159]:
# we need col "reply_to", to remove @screen_name in col "tweet", so that only leaves tweet context he posted.
#a = pd.DataFrame.from_dict(pd.DataFrame(df["reply_to"]))

In [193]:
en.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11052 entries, 0 to 3114
Data columns (total 44 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Unnamed: 0       8488 non-null   float64       
 1   id               11052 non-null  int64         
 2   conversation_id  11052 non-null  int64         
 3   created_at       11052 non-null  object        
 4   date             11052 non-null  datetime64[ns]
 5   timezone         11052 non-null  int64         
 6   place            0 non-null      float64       
 7   tweet            11052 non-null  object        
 8   language         11052 non-null  object        
 9   hashtags         11052 non-null  object        
 10  cashtags         11052 non-null  object        
 11  user_id          11052 non-null  int64         
 12  user_id_str      8488 non-null   float64       
 13  username         11052 non-null  object        
 14  name             11052 non-null  object

In [194]:
#a

Unnamed: 0,reply_to
0,"[{'screen_name': 'neilsiegel', 'name': 'Neil S..."
1,"[{'screen_name': 'Kreative', 'name': 'Leslie',..."
2,"[{'screen_name': 'dd_hogan', 'name': 'Live4EVD..."
3,"[{'screen_name': 'Jason', 'name': 'jason@calac..."
4,"[{'screen_name': 'kabirakhtar', 'name': 'kabir..."
...,...
3110,"[{'screen_name': 'flcnhvy', 'name': 'Viv ✶', '..."
3111,[]
3112,"[{'screen_name': 'newscientist', 'name': 'New ..."
3113,"[{'screen_name': 'comma_ai', 'name': 'comma', ..."


In [166]:
#en["reply_to"] = pd.DataFrame(en["reply_to"])

In [195]:
type(en["reply_to"])

pandas.core.series.Series

In [196]:
type(en["reply_to"].values[0])

str

In [197]:
en["reply_to"].values[0]

"[{'screen_name': 'neilsiegel', 'name': 'Neil Siegel', 'id': '255527845'}, {'screen_name': 'Tesla', 'name': 'Tesla', 'id': '13298072'}]"

In [187]:
import json

sting1 = "[{'screen_name': 'neilsiegel', 'name': 'Neil Siegel', 'id': '255527845'}, {'screen_name': 'Tesla', 'name': 'Tesla', 'id': '13298072'}]"

# Replace single quotes with double quotes and convert to a valid JSON string
json_str = sting1.replace("'", "\"")

# Convert the JSON string to a Python dictionary
my_dict = json.loads(json_str)

print(my_dict)


[{'screen_name': 'neilsiegel', 'name': 'Neil Siegel', 'id': '255527845'}, {'screen_name': 'Tesla', 'name': 'Tesla', 'id': '13298072'}]


In [190]:
my_dict["screen_name"]

TypeError: list indices must be integers or slices, not str

In [124]:
# example 1. it has 1 reply name.
reply_name_list1 = [{'screen_name': 'FredericLambert', 'name': 'Fred Lambert', 'id': '38253449'}]
tweet_full1 = "@FredericLambert Kinda"
tweet_reply_name = "@FredericLambert"

# example 2. it has 1 reply name, but @TeslaMotors.
tweet_full2 = "@TheStaceyRoy @TeslaMotors California Cabernet"
reply_name_list2 = [{'screen_name': 'TheStaceyRoy', 'name': 'Stacey Roy', 'id': '440525434'}]

# example 3. it has 3 reply names.
tweet_full3 = "@waltmossberg @mims @defcon_5 Et tu, Walt?"
reply_name_list3 = [{'screen_name': 'waltmossberg', 'name': 'Walt Mossberg', 'id': '5746452'}, 
 {'screen_name': 'mims', 'name': 'Christopher Mims', 'id': '1769191'}, 
 {'screen_name': 'defcon_5', 'name': 'defcon_5', 'id': '17212941'}]

In [None]:
# Using loop + isinstance()
for dicts in row_list:
    for key, val in dicts.items():
         
        # isinstance() is used to check for list to convert
        if isinstance(val, list):
            dicts[key] = val[0]
 
# printing result
print("The converted Dictionary list : " + str(test_list))

In [126]:
len(reply_name_list3)

3

In [198]:
type(en["reply_to"])

pandas.core.series.Series

In [203]:
df["reply_to"].values[0]

"[{'screen_name': 'neilsiegel', 'name': 'Neil Siegel', 'id': '255527845'}, {'screen_name': 'Tesla', 'name': 'Tesla', 'id': '13298072'}]"

In [206]:
aa = df["reply_to"].values[0].replace("'", "\"")
aa
aa = json.loads(aa)
aa


[{'screen_name': 'neilsiegel', 'name': 'Neil Siegel', 'id': '255527845'},
 {'screen_name': 'Tesla', 'name': 'Tesla', 'id': '13298072'}]

In [209]:
df["reply_to"].values[0]

"[{'screen_name': 'neilsiegel', 'name': 'Neil Siegel', 'id': '255527845'}, {'screen_name': 'Tesla', 'name': 'Tesla', 'id': '13298072'}]"

In [207]:
def remove_reply_name(df: pd.DataFrame, col: str, rect: str):
    for i in range(len(df)):
        aa = df[col].values[i].replace("'", "\"")
        aa = json.loads(aa)
        df[col].values[i] = json.loads(aa)
        for dicts in df[col].values[i]:
                name = "@" + dicts["screen_name"]
                df[rect].values[i] = df[rect].values[i].replace(name, "")
    return df[rect]

#print(tweet_full3.replace(("@" + reply_name_list3[0]["screen_name"]), ''))

In [208]:
remove_reply_name(en, "reply_to", "tweet")

JSONDecodeError: Invalid \escape: line 1 column 57 (char 56)

In [119]:
type(pd.DataFrame(df["reply_to"]))

pandas.core.frame.DataFrame

In [None]:
b = len(df["reply_to"])

In [113]:
en["tweet"]

0                     @neilsiegel @Tesla Coming very soon
1            @Kreative Vastly better maps/nav coming soon
3                                      @Jason @Tesla Sure
4       @kabirakhtar Yeah, it’s terrible. Had to upgra...
5       @sustainableanna @VanSeedBank Similar total si...
                              ...                        
3110    @flcnhvy Tesla is responsible for 2/3 of all t...
3111    So proud of the Tesla team for achieving this ...
3112    @newscientist Um, we have giant fusion reactor...
3113    @comma_ai Tesla Full Self-Driving will work at...
3114    @PPathole Dojo isn’t needed, but will make sel...
Name: tweet, Length: 11052, dtype: object

In [111]:
wordcloud = WordCloud().generate(en["tweet"])
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

TypeError: expected string or bytes-like object

# 1. TF
1. Compute the term frequencies for each year. They should be **normalized** (scale of [0, 1]). **Exclude stopwords.**

In [102]:
#df.isna()

In [103]:
# shape(df)
# df = df[["date","tweet"]]
# shape(df)

In [104]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/audrey/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# 2. Top 10 words
2. Show the top 10 words (for each year) by highest value of word frequency. 

# 3. Histogram
3. Plot a histogram of word frequencies for each year 

# 4. Zipf's law
4. Demonstrate Zipf’s law by plotting log-log plots of word frequencies v. rank for each year 

# 5. TF-IDF
5. Use TF-IDF to calculate and show the 5 most “important” words for each year 