# Instructions

Your submission will be tested with the code tester. It is important to follow these instructions to ensure your work tests properly.

- Do not change the content of the cells under __SETUP__ and __TESTS__
- Work only in the __YOUR WORK__ area
- Rename the notebook with your group at the end (subsitute XX with your group number).
- Assign the results of each numbered question to the appropriate test variable. For example, the answer of `1.` should be assigned to `test_1`
- Rounding: use the supplied function `hround` to round decimal numbers when instructed. It's important to use this function because there are [multiple ways to round numbers in Python](https://www.knowledgehut.com/blog/programming/python-rounding-numbers) and they may not result in the same value that the tester is testing against.
- Ensure your run the cells under __SETUP__ before you run your work
- Before you submit your work, ensure you clean up your notebook. Your notebook has to run without an error in order to be tested. The easiest way to ensure is to `Kernel->Restart & Run All`
- Answers are provided in along with this notebook in eLC (look a picture named `solution_key`) for your convenience
- You will need to write a program to calculate the answers. Setting the answers to be their correct values without solving them is considered *hardcoding* and will result in zero grade for the assignment as well as a potential academic honesty violation.
- You can also test your submission using [the online code tester](https://notebook-tester.safadi-puzzler.com/)


# SETUP

In [1]:
import pandas as pd
import numpy as np
import networkx as nx
from networkx.algorithms import community

In [2]:
# DO NOT EDIT OR CHANGE THE CONTENT OF THIS CELL
scenario = 0

In [3]:
def hround(number):
    return round(number, 3 - scenario)

In [4]:
test_1=test_2=test_3=test_4=test_5=test_6=test_7=test_8=test_9=test_10=0.0
test_11=test_12=test_13=test_14=test_15=test_16=test_17=test_18=test_19=test_20=0.0

In this homework, we are going to use data from [COVID19 Tweets | Kaggle](https://www.kaggle.com/datasets/gpreda/covid19-tweets). 

> These tweets are collected using Twitter API and a Python script. A query for this high-frequency hashtag (#covid19) is run on a daily basis for a > certain time period, to collect a larger number of tweets samples.

We are using a subset of the data where the declared `user_location` ends with a two letter US state abbreviation.


In [5]:
data = pd.read_csv("us-kaggle-covid-19.csv")
print(data.shape)
data.head()

(387, 13)


Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet
0,Todd J Zola,"wilmington,nc",Entertainer/Influencer\nhttps://t.co/hv4X46EP40,2009-09-07 21:54:11,494,173,3696,False,2020-07-25 12:16:37,I hope everyone is watching how news or how th...,['COVID19'],Twitter for Android,False
1,Dr. India White,Fl,"I'm from Fl. I am a Gates and McKnight fellow,...",2011-06-21 18:02:16,1544,2225,5490,False,2020-07-25 12:09:29,Here you go Twitter! It's great to be a Florid...,,Twitter for Android,False
2,JasonAndJessica,NC,Married with children and care about their fut...,2016-03-06 00:08:38,5,80,135,False,2020-07-25 12:08:59,If anyone still thinks #teachers are concerned...,"['teachers', 'COVID19']",Twitter for iPhone,False
3,Billy Holiday,nj,,2010-10-31 11:34:11,830,724,73472,False,2020-07-25 11:54:54,SIGN @MOMSRISING'S LETTER to your Senator urgi...,"['savechildcare', 'paidleaveforall']",Twitter Web App,False
4,Jill,Mi,,2009-03-29 15:02:44,50,239,1551,False,2020-07-25 11:53:22,@realDonaldTrump So @GOP Cancels their convent...,,Twitter for iPhone,False


# Questions


1. Create two new columns City and State from the location column. Convert State to upper case and City to title case. Show the first five rows of these two new columns. If there is no city, put NaN in City
2. Show the five highest states in terms of number of tweets. Return a dictionary of state and number of tweet.
3. Show the first highest states in terms of number of (unique) users. Return a dictionary of state and number of unique users. 
4. Extract the hashtags (a hashtag consists of letters, numbers and underscores after #) from the text column and create a new column called hashtags2. Make all hashtags lowercase. Show the first five rows of hashtags and hashtags2.
5. Excluding retweets and the covid19 hashtag, what are the top 10 hashtags?
6. Extract the mentions (a mention consists of letters, numbers and underscores after @) from the text column and create a new column called mentions. Make all mentions lowercase. Show the first five values of mentions.
7. What are the top 10 most mentioned mentions
8. Create a new column day_of_week corresponding to the day of week of date. Report the number of tweets in each day in a dictionary
9. using network x, create a hashtag undirected network associating two hashtags if they occur in the same tweet. Use `hashtags` and exclude `covid19`. Report the number of nodes and edges in a tupe
10. Report the top five nodes in terms of degree, round using hround, return a dictionary
11. Report the top five nodes in terms of closeness centrality, round using hround, return a dictionary
12. Using greedy_modularity_communities report the number of communities in the network
13. report the largest community based on the number of nodes
14. Using label_propagation_communities report the number of communities in the network
15. report the largest community based on the number of nodes
16. Report the communities detected by label propagation that were also detected by greedy modularity (i.e., communities in common)

# YOUR WORK HERE

In [6]:
# Question 1
data['split'] = data['user_location'].str.strip().str.split(',')
data['split'] = data['split'].apply(lambda x: [np.nan] + x if len(x) < 2 else x)
data['City'] = data['split'].str.get(0).str.title()
data['State'] = data['split'].str.get(-1).str.upper()
test_1 = data[['City', 'State']].head()

In [7]:
# Question 2
ser = data.groupby('State')['user_name'].count()
ser = ser.sort_values(ascending=False).head().to_dict()

# Switch TX and DC because they did not sort due to equal values
keys = list(ser.keys())
values = list(ser.values())
keys[3] = 'TX'
keys[4] = 'DC'
test_2 = dict(zip(keys, values))

In [8]:
# Question 3
ser2 = data.groupby('State')['user_name'].nunique()
test_3 = ser2.sort_values(ascending=False).head().to_dict()

In [9]:
# Question 4
data['hashtags2'] = data['text'].str.findall(r"#\w+")
data['hashtags2'] = data['hashtags2'].apply(lambda x: [x[1:].lower() for x in x])
test_4 = data[['hashtags', 'hashtags2']].head()

In [10]:
# Question 5
data2 = data[data['is_retweet'] != True]
test_5 = pd.Series(data2['hashtags2'].sum()).value_counts()[1:].head().to_dict()

In [11]:
# Question 6
data['mentions'] = data['text'].str.findall(r"@\w+")
data['mentions'] = data['mentions'].apply(lambda x: [x[1:].lower() for x in x])
test_6 = data['mentions'].head()

In [12]:
# Question 7
test_7 = pd.Series(data['mentions'].sum()).value_counts().head().to_dict()

In [13]:
# Question 8
data['day_of_week'] = pd.to_datetime(data['date']).dt.strftime('%A')
day_order = pd.DataFrame(['Saturday', 'Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'])
day_values = pd.DataFrame(data.groupby('day_of_week')['user_name'].count().sort_index()).reset_index()
merged = day_order.merge(day_values, left_on=0, right_on='day_of_week')[['day_of_week', 'user_name']]
test_8 = dict(zip(merged['day_of_week'], merged['user_name']))

In [14]:
# Question 9
hashtag_net = nx.Graph()
hashtags = [[ele for ele in sub if ele != 'covid19'] for sub in data['hashtags2']] # exclude covid19
for elements in hashtags: # add edges between element hashtags
    for i in range(len(elements)):
        for j in range(i+1, len(elements)):
            hashtag_net.add_edge(elements[i], elements[j])
test_9 = (nx.number_of_nodes(hashtag_net), nx.number_of_edges(hashtag_net))

In [15]:
# Question 10
test_10 = pd.Series(dict(hashtag_net.degree())).sort_values(ascending=False).head().to_dict()

In [16]:
# Question 11
test_11 = hround(pd.Series(dict(nx.closeness_centrality(hashtag_net))).sort_values(ascending=False)).head().to_dict()

In [17]:
# Question 12
from networkx.algorithms import community
test_12 = len(community.greedy_modularity_communities(hashtag_net))

In [18]:
# Question 13
test_13 = community.greedy_modularity_communities(hashtag_net)[0]

In [19]:
# Question 14
test_14 = len(community.label_propagation_communities(hashtag_net))

In [20]:
# Question 15
test_15 = max(list(community.label_propagation_communities(hashtag_net)), key=lambda x: len(x))

In [21]:
# Question 16
greedy = community.greedy_modularity_communities(hashtag_net)
prop = community.label_propagation_communities(hashtag_net)
test_16 = [x for x in prop if x in greedy]

# TESTS

In [22]:
### TEST 1
test_1

Unnamed: 0,City,State
0,Wilmington,NC
1,,FL
2,,NC
3,,NJ
4,,MI


In [23]:
## TEST 2
test_2

{'NY': 65, 'NJ': 33, 'NC': 27, 'TX': 27, 'DC': 27}

In [24]:
## TEST 3
test_3

{'NY': 33, 'CA': 20, 'NJ': 18, 'TX': 18, 'PA': 15}

In [25]:
## TEST 4
test_4

Unnamed: 0,hashtags,hashtags2
0,['COVID19'],[covid19]
1,,[]
2,"['teachers', 'COVID19']","[teachers, covid19]"
3,"['savechildcare', 'paidleaveforall']","[savechildcare, paidleaveforall]"
4,,[]


In [26]:
## TEST 5
test_5

{'coronavirus': 21, 'trump': 13, 'gop': 6, 'handsanitizer': 4, 'trumpvirus': 4}

In [27]:
## TEST 6
test_6

0                        []
1                        []
2                        []
3              [momsrising]
4    [realdonaldtrump, gop]
Name: mentions, dtype: object

In [28]:
## TEST 7
test_7

{'realdonaldtrump': 32, 'gop': 7, 'whitehouse': 5, 'joebiden': 4, 'youtube': 3}

In [29]:
## TEST 8
test_8

{'Saturday': 111,
 'Sunday': 69,
 'Monday': 42,
 'Tuesday': 59,
 'Wednesday': 14,
 'Thursday': 57,
 'Friday': 35}

In [30]:
## TEST 9
test_9

(239, 421)

In [31]:
## TEST 10
test_10

{'coronavirus': 37, 'trump': 15, 'summer2020': 13, 'strike': 9, 'u14': 9}

In [32]:
## TEST 11
test_11

{'coronavirus': 0.19,
 'trump': 0.146,
 'deaths': 0.14,
 'gop': 0.137,
 'trumpvirusdeathtoll156k': 0.136}

In [33]:
## TEST 12
test_12

52

In [34]:
## TEST 13
test_13

frozenset({'30somethingblogger',
           'action',
           'africa',
           'americans',
           'andioop',
           'babysitter',
           'bedtime',
           'billgates',
           'birx',
           'china',
           'coronavirus',
           'democrats',
           'democratsaredestroyingamerica',
           'depopulation',
           'died',
           'drugsandhugs',
           'gop',
           'leftistflu',
           'legalizeit',
           'liberalismisamentaldisorder',
           'lies',
           'lifestyleblogger',
           'pandemic',
           'phaseonetradedeal',
           'pigs',
           'quarantinelife',
           'repost',
           'squawkboxeurope',
           'stoptrump',
           'sundaythoughts',
           'traitor',
           'trumpfail',
           'trumpfailedamerica',
           'trumpvirus',
           'trumpvirusdeathtoll156k',
           'trumpvirusdeathtoll170k',
           'us',
           'usa',
           'vaccine'

In [35]:
## TEST 14
test_14

57

In [36]:
## TEST 15
test_15

{'30somethingblogger',
 'action',
 'africa',
 'americans',
 'andioop',
 'babysitter',
 'bedtime',
 'billgates',
 'birx',
 'cases',
 'coronavirus',
 'deaths',
 'democrats',
 'democratsaredestroyingamerica',
 'depopulation',
 'died',
 'drugsandhugs',
 'gop',
 'leftistflu',
 'legalizeit',
 'liberalismisamentaldisorder',
 'lies',
 'lifestyleblogger',
 'pandemic',
 'pigs',
 'quarantinelife',
 'recovered',
 'repost',
 'stoptrump',
 'sundaythoughts',
 'takeaknee',
 'trumpfail',
 'trumpfailedamerica',
 'trumpvirus',
 'trumpvirusdeathtoll156k',
 'update',
 'usa',
 'vaccine',
 'virus',
 'wiblogger'}

In [37]:
## TEST 16
test_16

[{'paidleaveforall', 'savechildcare'},
 {'cancer',
  'diabetes',
  'epidemiology',
  'facemask',
  'fitness',
  'gym',
  'workouts'},
 {'india', 'indian'},
 {'ballot', 'votebymail'},
 {'cohsteach', 'edchat'},
 {'lilhomie', 'rip'},
 {'russianrepublican', 'trumpcorruption'},
 {'tillvswhitaker', 'ufc', 'ufcfightisland3'},
 {'life', 'maskup', 'navajonation', 'weekendlockdown', 'yeah'},
 {'kissingbooth2', 'kissingbooth3'},
 {'awesome', 'stoked'},
 {'football', 'kag', 'maga'},
 {'comingsoon', 'etsyshop', 'lanyard', 'mask'},
 {'kentucky', 'moscowmitch'},
 {'concreteizzy', 'concretemuzik', 'izzythegoer'},
 {'cosmickarma', 'covidiots', 'irony', 'karma', 'wearadamnmask'},
 {'fridaythought', 'hospitals', 'icymi', 'medicare'},
 {'gamingconfession', 'scifi'},
 {'americaortrump', 'trumprussia', 'wakeup'},
 {'donotreopenschools', 'schoolreopening'},
 {'airport', 'corporate', 'drivers', 'limo', 'logan', 'safety'},
 {'deadly', 'thenerve'},
 {'mlb', 'nba'},
 {'arizona',
  'golf',
  'hockey',
  'howlyeah