# Plan : Data Cleaning / Feature Extraction. 

Tweet-level features : 
    - delete null rows
    - tweet length / num of words for each tweet / number of sentences / average length of words
    - vader sentiment (neg/neu/pos/compound)
    - number of likes

City-level features :
    - num of tweets
    - avg tweet length / avg num words per tweet/ average number of sentences / average word length / average sentiment scores (neg/neu/pos/compound) 
    - num of words into concatenated tweets
    - weight of thematic BoW
    - extract features from counts of stopwords, ponctuation ...
    - convert emoji, emoticons into words
    - nlp tasks (expand contractions, remove digits, @mentions, url, punctuation, tokenizing, cleaning stopwords, useless, lemmatize, num clean tokens) 
    - part of speech tagging on tokens
    - most frequent 100 tokens extraction
    - sentiment analysis vader
    - sentiment polarity and subjectivity with textblob
    - countvectorizer with scikit learn
    - tf-idf with scikit learn 

In [1]:
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import numpy as np
import re
import string

from sklearn.feature_extraction.text import *
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer 
from nltk.stem.porter import PorterStemmer
from nltk.probability import FreqDist

from textblob import TextBlob
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from emot.emo_unicode import UNICODE_EMO, EMOTICONS

# algorithms
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [2]:
smartcities100 = pd.read_csv('/Users/juliencarbonnell/Desktop/Thèse/DONNÉES/1.Twitter/smartcities100/smartcities100 tweets.csv')

In [3]:
smartcities100.head()

Unnamed: 0,tweetDate,content,twitterProfile,tweetUrl,timestamp,query,rank2020
0,Sun Jan 17 10:57:51 +0000 2021,A Delegation from @aau_ae visited @BurjeelMed...,https://twitter.com/Atatreh,https://twitter.com/Atatreh/status/13507592313...,2021-01-21T13:59:48.147Z,#AbuDhabi,42
1,Thu Jan 14 17:40:03 +0000 2021,Wizz Air #AbuDhabi is set to launch flights to...,https://twitter.com/UAE_Forsan,https://twitter.com/UAE_Forsan/status/13497732...,2021-01-21T13:59:48.147Z,#AbuDhabi,42
2,Wed Jan 20 20:25:02 +0000 2021,What a great grappling exchange by both man #A...,https://twitter.com/RdosAnjosMMA,https://twitter.com/RdosAnjosMMA/status/135198...,2021-01-21T13:59:48.147Z,#AbuDhabi,42
3,Sun Jan 17 16:55:17 +0000 2021,Totally worth the 3.5 hour drive to #AbuDhabi ...,https://twitter.com/zoomnclick,https://twitter.com/zoomnclick/status/13508491...,2021-01-21T13:59:48.147Z,#AbuDhabi,42
4,Mon Jan 18 04:44:44 +0000 2021,Good morning #AbuDhabi Ireland 🇮🇪 vs UAE 🇦🇪 fi...,https://twitter.com/ChTahirmehmood,https://twitter.com/ChTahirmehmood/status/1351...,2021-01-21T13:59:48.147Z,#AbuDhabi,42


In [4]:
# how many smartcities are there in my dataset ?
smartcities100['query'].nunique()

106

I will append my 3 case studies Taipei, Tallinn, Tel Aviv which were stored in dedicated files.

In [5]:
taipei = pd.read_csv('/Users/juliencarbonnell/Desktop/Thèse/DONNÉES/1.Twitter/Taipei/#Taipei.csv')
telaviv = pd.read_csv('/Users/juliencarbonnell/Desktop/Thèse/DONNÉES/1.Twitter/Tel Aviv/#Telaviv.csv')
tallinn = pd.read_csv('/Users/juliencarbonnell/Desktop/Thèse/DONNÉES/1.Twitter/Tallinn/#Tallinn.csv')

The ranking2020 column is missing on these 3 files. Will add it before merging the files in the same dataframe

In [6]:
taipei['rank2020'] = '8'
telaviv['rank2020'] = '50'
tallinn['rank2020'] = '59'

In [7]:
smartcities100 = pd.concat([smartcities100, taipei,telaviv,tallinn], axis=0)

In [8]:
smartcities100.shape

(110862, 7)

In [9]:
smartcities100['query'].nunique()

109

In [10]:
smartcities100['query'].value_counts()

#telaviv         17612
#Taipei          17337
#Tallinn          6295
#Brisbane         2080
#Manchester       1877
#Bilbao           1338
#Zaragoza         1290
#Montreal         1231
#Vancouver        1187
#SanFrancisco     1022
#Bologna          1000
#Rome             1000
#Athens           1000
#Bengaluru        1000
#Oslo             1000
#Berlin           1000
#Toronto          1000
#Mumbai           1000
#Amsterdam        1000
#Bogota           1000
#Bangkok          1000
#Munich           1000
#HongKong          999
#Helsinki          998
#Lagos             991
#Dublin            974
#Newcastle         973
#Barcelona         956
#Hyderabad         946
#Santiago          939
#BuenosAires       932
#Nairobi           918
#SaoPaulo          913
#Chicago           907
#Madrid            903
#Moscow            898
#CapeTown          886
#Boston            868
#Melbourne         866
#London            858
#LosAngeles        856
#Medellin          850
#AbuDhabi          832
#Abuja     

In [11]:
smartcities100.head()

Unnamed: 0,tweetDate,content,twitterProfile,tweetUrl,timestamp,query,rank2020
0,Sun Jan 17 10:57:51 +0000 2021,A Delegation from @aau_ae visited @BurjeelMed...,https://twitter.com/Atatreh,https://twitter.com/Atatreh/status/13507592313...,2021-01-21T13:59:48.147Z,#AbuDhabi,42
1,Thu Jan 14 17:40:03 +0000 2021,Wizz Air #AbuDhabi is set to launch flights to...,https://twitter.com/UAE_Forsan,https://twitter.com/UAE_Forsan/status/13497732...,2021-01-21T13:59:48.147Z,#AbuDhabi,42
2,Wed Jan 20 20:25:02 +0000 2021,What a great grappling exchange by both man #A...,https://twitter.com/RdosAnjosMMA,https://twitter.com/RdosAnjosMMA/status/135198...,2021-01-21T13:59:48.147Z,#AbuDhabi,42
3,Sun Jan 17 16:55:17 +0000 2021,Totally worth the 3.5 hour drive to #AbuDhabi ...,https://twitter.com/zoomnclick,https://twitter.com/zoomnclick/status/13508491...,2021-01-21T13:59:48.147Z,#AbuDhabi,42
4,Mon Jan 18 04:44:44 +0000 2021,Good morning #AbuDhabi Ireland 🇮🇪 vs UAE 🇦🇪 fi...,https://twitter.com/ChTahirmehmood,https://twitter.com/ChTahirmehmood/status/1351...,2021-01-21T13:59:48.147Z,#AbuDhabi,42


# 1. extract some tweet-level features
- tweet length for each tweet
- number of words for each tweet
- number of sentences for each tweet
- average word length for each tweet

In [12]:
# how many null values in content column ?
smartcities100['content'].isna().sum()

12

In [13]:
# keep only rows when content column value is not null
smartcities100 = smartcities100[smartcities100['content'].notnull()]

In [14]:
smartcities100['content'].isna().sum()

0

In [15]:
# add a column with tweet length for each tweet
smartcities100['tweet_len'] = smartcities100['content'].apply(lambda x: len(x))

In [16]:
# add a column with number of words for each tweet
smartcities100['num_words'] = smartcities100['content'].apply(lambda x: len(x.split()))

In [17]:
# add a column with number of stopwords for each tweet
stop = stopwords.words('english')
smartcities100['stopwords'] = smartcities100['content'].apply(lambda x: len([x for x in x.split() if x in stop]))

In [18]:
#add a column with number of sentences for each tweet
smartcities100['num_sentences'] = smartcities100['content'].apply(lambda x: len(re.split( '~ ...' ,'~'.join(x.split('.')))))

In [19]:
# create function for finding average length of words per tweet
def get_avg_word_len(x):
    words = x.split()
    word_len = 0
    for word in words:
        word_len = word_len + len(word)
        
    return word_len / len(words)

# add a column with average word length for each tweet
smartcities100['avg_word_len'] = smartcities100['content'].apply(lambda x: get_avg_word_len(x))

In [20]:
# create a fuction to find the number of punctuation for each tweet
def count_punct(text):
    count = sum([1 for char in text if char in string.punctuation])
    return count

# add a column with number of punctuation for each tweet
smartcities100['punctuation'] = smartcities100['content'].apply(lambda x: count_punct(x))

In [21]:
# count number of hashtag per tweet
smartcities100['hashtags'] = smartcities100['content'].apply(lambda x: len([x for x in x.split() if x.startswith('#')]))

In [22]:
# count number of numerics per tweet
smartcities100['numerics'] = smartcities100['content'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))

In [23]:
# count number of uppercase words
smartcities100['upper'] = smartcities100['content'].apply(lambda x: len([x for x in x.split() if x.isupper()]))

In [24]:
# use vader sentiment to get % of pos, neg, neu in new columns
analyzer= SentimentIntensityAnalyzer()

smartcities100['neg'] = [analyzer.polarity_scores(v)['neg'] for v in smartcities100['content']]
smartcities100['neu'] = [analyzer.polarity_scores(v)['neu'] for v in smartcities100['content']]
smartcities100['pos'] = [analyzer.polarity_scores(v)['pos'] for v in smartcities100['content']]
smartcities100['compound'] = [analyzer.polarity_scores(v)['compound'] for v in smartcities100['content']]

In [25]:
# sentiment polarity and subjectivity with textblob
smartcities100['polarity_tweet'] = smartcities100['content'].apply(lambda x: TextBlob(x).sentiment.polarity)
smartcities100['subjectivity_tweet'] = smartcities100['content'].apply(lambda x: TextBlob(x).sentiment.subjectivity)

#### find number of likes for each tweet

In [26]:
#check the columns of my dataset
smartcities100.head()

Unnamed: 0,tweetDate,content,twitterProfile,tweetUrl,timestamp,query,rank2020,tweet_len,num_words,stopwords,num_sentences,avg_word_len,punctuation,hashtags,numerics,upper,neg,neu,pos,compound,polarity_tweet,subjectivity_tweet
0,Sun Jan 17 10:57:51 +0000 2021,A Delegation from @aau_ae visited @BurjeelMed...,https://twitter.com/Atatreh,https://twitter.com/Atatreh/status/13507592313...,2021-01-21T13:59:48.147Z,#AbuDhabi,42,303,38,10,2,6.973684,14,2,0,2,0.104,0.849,0.047,-0.5267,0.4,0.6
1,Thu Jan 14 17:40:03 +0000 2021,Wizz Air #AbuDhabi is set to launch flights to...,https://twitter.com/UAE_Forsan,https://twitter.com/UAE_Forsan/status/13497732...,2021-01-21T13:59:48.147Z,#AbuDhabi,42,87,12,3,1,6.333333,8,2,0,0,0.0,1.0,0.0,0.0,0.0,0.0
2,Wed Jan 20 20:25:02 +0000 2021,What a great grappling exchange by both man #A...,https://twitter.com/RdosAnjosMMA,https://twitter.com/RdosAnjosMMA/status/135198...,2021-01-21T13:59:48.147Z,#AbuDhabi,42,53,9,3,1,5.0,1,1,0,0,0.0,0.661,0.339,0.6249,0.8,0.75
3,Sun Jan 17 16:55:17 +0000 2021,Totally worth the 3.5 hour drive to #AbuDhabi ...,https://twitter.com/zoomnclick,https://twitter.com/zoomnclick/status/13508491...,2021-01-21T13:59:48.147Z,#AbuDhabi,42,114,18,5,1,5.388889,10,2,1,1,0.0,0.748,0.252,0.659,0.3,0.1
4,Mon Jan 18 04:44:44 +0000 2021,Good morning #AbuDhabi Ireland 🇮🇪 vs UAE 🇦🇪 fi...,https://twitter.com/ChTahirmehmood,https://twitter.com/ChTahirmehmood/status/1351...,2021-01-21T13:59:48.147Z,#AbuDhabi,42,91,12,0,1,6.666667,7,2,0,1,0.0,0.791,0.209,0.4404,0.35,0.8


Now my dataset is complete and we can see that my 3 case studies are bigger samples from the other cities. This unbalance is not annoying since I will work on means and standard deviations values. It gives my samples more proof when compared to the whole population. 

# 2. Extract city-level features 
- create a dataframe with city_id as index
- use groupby on the 'query' column 
- aggregate city-level features from the tweet-level dataframe 
- take the weight of smartcityBoWs for each city

## Create a new dataframe to store features 

In [27]:
# copy the unique values from 'query' column as index in the new dataframe
index = smartcities100['query'].unique()
columns = []

# create a new DataFrame
sc100_features = pd.DataFrame(index=index, columns=columns)

In [28]:
# agg rank2020 column
sc100_features = pd.concat([sc100_features, smartcities100.groupby('query').rank2020.agg(['min'])], axis=1)
sc100_features.head()

Unnamed: 0,min
#AbuDhabi,42
#Abuja,107
#Amsterdam,9
#Ankara,57
#Athens,99


In [29]:
# agg num_tweets column
sc100_features = pd.concat([sc100_features, smartcities100['query'].value_counts()], axis=1)
sc100_features.head()

Unnamed: 0,min,query
#AbuDhabi,42,832.0
#Abuja,107,828.0
#Amsterdam,9,1000.0
#Ankara,57,313.0
#Athens,99,1000.0


In [30]:
# agg total tweet length by city
sc100_features['total_tweet_len'] = smartcities100.groupby('query')['tweet_len'].sum()
sc100_features.head()

Unnamed: 0,min,query,total_tweet_len
#AbuDhabi,42,832.0,161350.0
#Abuja,107,828.0,173587.0
#Amsterdam,9,1000.0,180311.0
#Ankara,57,313.0,58158.0
#Athens,99,1000.0,186752.0


In [31]:
# calculate average tweet length
sc100_features['avg_tweet_len'] = sc100_features['total_tweet_len']/sc100_features['query']

In [32]:
# agg total number of words 
sc100_features['total_num_words'] = smartcities100.groupby('query')['num_words'].sum()

In [33]:
# calculate average number of words
sc100_features['avg_num_words'] = sc100_features['total_num_words']/sc100_features['query']

In [34]:
# agg total number of stopwords 
sc100_features['total_stopwords'] = smartcities100.groupby('query')['stopwords'].sum()

In [35]:
# calculate average number of stopwords
sc100_features['avg_stopwords'] = sc100_features['total_stopwords']/sc100_features['query']

In [36]:
# agg total number of sentences 
sc100_features['total_num_sentences'] = smartcities100.groupby('query')['num_sentences'].sum()

In [37]:
# calculate average number of sentences
sc100_features['avg_num_sentences'] = sc100_features['total_num_sentences']/sc100_features['query']

In [38]:
# agg total average word length
sc100_features['total_avg_word_len'] = smartcities100.groupby('query')['avg_word_len'].sum()

In [39]:
# calculate average of average word length
sc100_features['avg_avg_word_len'] = sc100_features['total_avg_word_len']/sc100_features['query']

In [40]:
# agg total number of punctuation
sc100_features['total_punctuation'] = smartcities100.groupby('query')['punctuation'].sum()

In [41]:
# calculate average number of punctuation
sc100_features['avg_punctuation'] = sc100_features['total_punctuation']/sc100_features['query']

In [42]:
# agg total number of hashtags
sc100_features['total_hashtags'] = smartcities100.groupby('query')['hashtags'].sum()

In [43]:
# calculate average number of hashtags
sc100_features['avg_hashtags'] = sc100_features['total_hashtags']/sc100_features['query']

In [44]:
# agg total number of numerics
sc100_features['total_numerics'] = smartcities100.groupby('query')['numerics'].sum()

In [45]:
# calculate average number of numerics
sc100_features['avg_numerics'] = sc100_features['total_numerics']/sc100_features['query']

In [46]:
# agg total number of uppercase words
sc100_features['total_upper'] = smartcities100.groupby('query')['upper'].sum()

In [47]:
# calculate average number of uppercase words
sc100_features['avg_upper'] = sc100_features['total_upper']/sc100_features['query']

In [48]:
# agg total sentiment scores
sc100_features['neg'] = smartcities100.groupby('query')['neg'].sum()
sc100_features['neu'] = smartcities100.groupby('query')['neu'].sum()
sc100_features['pos'] = smartcities100.groupby('query')['pos'].sum()
sc100_features['compound'] = smartcities100.groupby('query')['compound'].sum()

In [49]:
# calculate average sentiment scores
sc100_features['avg_neg'] = sc100_features['neg']/sc100_features['query']
sc100_features['avg_neu'] = sc100_features['neu']/sc100_features['query']
sc100_features['avg_pos'] = sc100_features['pos']/sc100_features['query']
sc100_features['avg_compound'] = sc100_features['compound']/sc100_features['query']

In [50]:
# agg polarity and subjectivity scores
sc100_features['polarity'] = smartcities100.groupby('query')['polarity_tweet'].sum()
sc100_features['subjectivity'] = smartcities100.groupby('query')['subjectivity_tweet'].sum()

In [51]:
# calculate average sentiment scores
sc100_features['avg_polarity'] = sc100_features['polarity']/sc100_features['query']
sc100_features['avg_subjectivity'] = sc100_features['subjectivity']/sc100_features['query']

In [52]:
sc100_features.head()

Unnamed: 0,min,query,total_tweet_len,avg_tweet_len,total_num_words,avg_num_words,total_stopwords,avg_stopwords,total_num_sentences,avg_num_sentences,total_avg_word_len,avg_avg_word_len,total_punctuation,avg_punctuation,total_hashtags,avg_hashtags,total_numerics,avg_numerics,total_upper,avg_upper,neg,neu,pos,compound,avg_neg,avg_neu,avg_pos,avg_compound,polarity,subjectivity,avg_polarity,avg_subjectivity
#AbuDhabi,42,832.0,161350.0,193.930288,20802.0,25.002404,4821.0,5.794471,1194.0,1.435096,5887.253316,7.076026,11662.0,14.016827,3829.0,4.602163,232.0,0.278846,908.0,1.091346,18.248,715.63,98.123,249.5025,0.021933,0.860132,0.117936,0.299883,158.079259,290.438874,0.189999,0.349085
#Abuja,107,828.0,173587.0,209.646135,22391.0,27.042271,4621.0,5.580918,1190.0,1.437198,5858.506206,7.075491,12726.0,15.369565,4792.0,5.78744,335.0,0.404589,1046.0,1.263285,15.064,709.099,103.828,295.2426,0.018193,0.8564,0.125396,0.356573,154.819778,302.964433,0.18698,0.365899
#Amsterdam,9,1000.0,180311.0,180.311,22627.0,22.627,5220.0,5.22,1520.0,1.52,7579.218712,7.579219,14554.0,14.554,5239.0,5.239,153.0,0.153,663.0,0.663,24.567,875.009,100.439,250.7977,0.024567,0.875009,0.100439,0.250798,152.441578,316.409962,0.152442,0.31641
#Ankara,57,313.0,58158.0,185.808307,7194.0,22.984026,1244.0,3.974441,409.0,1.306709,2350.81423,7.510589,4616.0,14.747604,1760.0,5.623003,106.0,0.338658,263.0,0.840256,12.42,271.031,29.554,60.1001,0.039681,0.865914,0.094422,0.192013,40.076816,75.45669,0.128041,0.241076
#Athens,99,1000.0,186752.0,186.752,24268.0,24.268,5405.0,5.405,1503.0,1.503,7170.59152,7.170592,15201.0,15.201,5442.0,5.442,226.0,0.226,865.0,0.865,42.164,865.646,92.19,176.3919,0.042164,0.865646,0.09219,0.176392,126.461888,320.792854,0.126462,0.320793


In [53]:
sc100_features = sc100_features.drop(columns=['total_tweet_len','total_num_words','total_stopwords','total_num_sentences','total_avg_word_len','total_punctuation','total_hashtags','total_numerics','total_upper','neg','neu','pos','compound','polarity','subjectivity'])

In [54]:
# create a new column with lists of merged tweets grouped by city
sc100_features = pd.concat([sc100_features, smartcities100.groupby('query')['content'].apply(lambda x: x.tolist())], axis=1)

In [55]:
# convert the merged tweets lists into strings
sc100_features['content'] = sc100_features['content'].apply(lambda x: str(x).strip('[]'))

In [56]:
# count number of words for each city
sc100_features['words'] = sc100_features['content'].apply(lambda x: len(x))

In [57]:
# count number of stopwords for each city
sc100_features['stopwords'] = sc100_features['content'].apply(lambda x: len([x for x in x.split() if x in stop]))

In [58]:
# count number of punctuation for each city
sc100_features['punctuation'] = sc100_features['content'].apply(lambda x: count_punct(x))

In [59]:
# count number of hashtags for each city
sc100_features['hashtags'] = sc100_features['content'].apply(lambda x: len([x for x in x.split() if x.startswith('#')]))

In [60]:
# count number of numerics for each city
sc100_features['numerics'] = sc100_features['content'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))

In [61]:
# count number of uppercase words for each city
sc100_features['upper'] = sc100_features['content'].apply(lambda x: len([x for x in x.split() if x.isupper()]))

# cleaning tweets merged

In [62]:
# create a new column to lower characters of tweets merged
sc100_features['clean_text'] = sc100_features['content'].apply(lambda x: x.lower())

In [63]:
# convert emojis into words
def convert_emojis(text):
    for emot in UNICODE_EMO:
        text = text.replace(emot, "_".join(UNICODE_EMO[emot].replace(",","").replace(":","").split()))
    return text

sc100_features['clean_text'] = sc100_features['clean_text'].apply(lambda x: convert_emojis(x))

In [64]:
# convert emoticons into words
def convert_emoticons(text):
    for emot in EMOTICONS:
        text = re.sub(u'('+emot+')', "_".join(EMOTICONS[emot].replace(",","").split()), text)
    return text

sc100_features['clean_text'] = sc100_features['clean_text'].apply(lambda x: convert_emoticons(x))

In [65]:
# remove punctuation
sc100_features['clean_text'] = sc100_features['clean_text'].str.replace('[^\w\s]','')

In [66]:
#remove symbols
sc100_features['clean_text'] = sc100_features['clean_text'].apply(lambda x: re.sub(r'.,;?!/\&*$€£`§"+-%|:_’`()','', str(x)))

In [67]:
# remove stopwords
sc100_features['clean_text'] = sc100_features['clean_text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))

In [68]:
# remove urls
def remove_url(text):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'', text)

sc100_features['clean_text'] = sc100_features['clean_text'].apply(lambda x: remove_url(x))

In [69]:
# remove htmls
def remove_html(text):
    html=re.compile(r'<.*?>')
    return html.sub(r'',text)

sc100_features['clean_text'] = sc100_features['clean_text'].apply(lambda x: remove_html(x))

In [70]:
# remove # 
import re
sc100_features['clean_text'] = sc100_features['clean_text'].apply(lambda x: re.sub(r'#',' ', x))

In [71]:
# remove @mentions
sc100_features['clean_text'] = sc100_features['clean_text'].apply(lambda x: re.sub(r'@[A-Za-z0-9]+','', str(x)))

In [72]:
#remove digits
sc100_features['clean_text'] = sc100_features['clean_text'].apply(lambda x: re.sub(r'[0-9]+','', str(x)))

In [73]:
# expand contractions 
contractions = contractions = { 
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how is",
"I'd": "I would",
"I'd've": "I would have",
"I'll": "I will",
"I'll've": "I will have",
"I'm": "I am",
"I've": "I have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so as / so is",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what'll've": "what will have",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"when's": "when is",
"when've": "when have",
"where'd": "where did",
"where's": "where is",
"where've": "where have",
"who'll": "who will",
"who'll've": "who will have",
"who's": "who is",
"who've": "who have",
"why's": "why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you would",
"you'd've": "you would have",
"you'll": "you will",
"you'll've": "you will have",
"you're": "you are",
"you've": "you have"
}
def cont_to_exp(x):
    if type(x) is str:
        x = x.replace('\\','')
        for key in contractions:
            value = contractions[key]
            x = x.replace(key,value)
        return x
    else:
        return x

sc100_features['clean_text'] = sc100_features['clean_text'].apply(lambda x: cont_to_exp(x))

In [74]:
# count number of clean words
sc100_features['num_clean_words'] = sc100_features['clean_text'].apply(lambda x: len(str(x).split(" ")))

In [75]:
# count number of characters into clean text (incl. spaces)
sc100_features['num_clean_characters'] = sc100_features['clean_text'].str.len()

In [76]:
# average word length for clean words
def avg_word(sentence):
    words = sentence.split()
    return (sum(len(word) for word in words)/(len(words)+0.000001))

sc100_features['avg_clean_word_len'] = sc100_features['clean_text'].apply(lambda x: avg_word(x)).round(1)

## Sentiment Analysis from clean text

In [77]:
# sentiment polarity and subjectivity with textblob
sc100_features['polarity_clean'] = sc100_features['clean_text'].apply(lambda x: TextBlob(x).sentiment.polarity)
sc100_features['subjectivity_clean'] = sc100_features['clean_text'].apply(lambda x: TextBlob(x).sentiment.subjectivity)

## Tokenize clean text

In [78]:
# tokenize clean text into a dedicated column
sc100_features['tokens'] = sc100_features['clean_text'].apply(lambda x: nltk.word_tokenize(x))

In [79]:
#remove useless tokens
useless = ["com","#",'&','!',"gt","'re","\n","amp","als","'s","als", "etc",'\n', "\\n",'\\n\\n',']','[','.','...','“',',',"'","''",'th','…','“',':']

sc100_features['tokens'] = sc100_features['tokens'].apply(lambda x: [item for item in x if item not in useless])

In [80]:
# Remove single letter words
sc100_features['tokens'] = sc100_features['tokens'].apply(lambda x: [item for item in x if len(item)>1] )

# part of speech tagging on tokens

In [81]:
sc100_features['pos_tag'] = sc100_features['tokens'].apply(lambda x: nltk.pos_tag(x))

## most frequent 100 tokens

In [82]:
# count most freq 100 words for each city
sc100_features['most_freq100'] = sc100_features['tokens'].apply(lambda x: FreqDist(x).most_common(100))

# Use algorythms to create some more variables

In [83]:
count_vect = CountVectorizer()
sc100_features['count_vec'] = sc100_features['tokens'].apply(lambda x: count_vect.fit_transform(x))

In [84]:
tfidf_transformer = TfidfTransformer()
sc100_features['tfidf'] = sc100_features['count_vec'].apply(lambda x: tfidf_transformer.fit_transform(x))

#### those two features will be useful to train a classifier prediction same as multinomialNB
- clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)
- clf2 = SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, max_iter=5, tol=None))

#### parameter tuning can be done using grid search
- parameters = {'vect__ngram_range': [(1, 1), (1, 2)], 'tfidf__use_idf': (True, False), 'clf__alpha': (1e-2, 1e-3)}, 
gs_clf = GridSearchCV(text_clf, parameters, cv=5, n_jobs=-1)

In [85]:
# Remove NaN row and reset index
sc100_features = sc100_features.reset_index().dropna()

In [86]:
# check my dataframe
sc100_features.head()

Unnamed: 0,index,min,query,avg_tweet_len,avg_num_words,avg_stopwords,avg_num_sentences,avg_avg_word_len,avg_punctuation,avg_hashtags,avg_numerics,avg_upper,avg_neg,avg_neu,avg_pos,avg_compound,avg_polarity,avg_subjectivity,content,words,stopwords,punctuation,hashtags,numerics,upper,clean_text,num_clean_words,num_clean_characters,avg_clean_word_len,polarity_clean,subjectivity_clean,tokens,pos_tag,most_freq100,count_vec,tfidf
0,#AbuDhabi,42,832.0,193.930288,25.002404,5.794471,1.435096,7.076026,14.016827,4.602163,0.278846,1.091346,0.021933,0.860132,0.117936,0.299883,0.189999,0.349085,'A Delegation from @aau_ae visited @BurjeelMe...,166602,4811,15703,3268,202,760,delegation aau_ae visited burjeelmedicity one ...,14484,173030,11.2,0.242483,0.471562,"[delegation, aau_ae, visited, burjeelmedicity,...","[(delegation, NN), (aau_ae, NN), (visited, VBD...","[(abudhabi, 677), (uae, 224), (abu, 156), (dub...","(0, 1235)\t1\n (1, 2)\t1\n (2, 5825)\t1\n ...","(0, 1235)\t1.0\n (1, 2)\t1.0\n (2, 5825)\t..."
1,#Abuja,107,828.0,209.646135,27.042271,5.580918,1.437198,7.075491,15.369565,5.78744,0.404589,1.263285,0.018193,0.8564,0.125396,0.356573,0.18698,0.365899,"'Brother #Abuja https://t.co/6J3yOIP8Jm', 'Our...",180295,4561,18553,3862,232,766,brother abuja httpsSkeptical_annoyed_undecided...,15083,182674,11.4,0.249437,0.498789,"[brother, abuja, httpsSkeptical_annoyed_undeci...","[(brother, NN), (abuja, NN), (httpsSkeptical_a...","[(abuja, 635), (lagos, 163), (abujatwittercomm...","(0, 732)\t1\n (1, 24)\t1\n (2, 2539)\t1\n ...","(0, 732)\t1.0\n (1, 24)\t1.0\n (2, 2539)\t..."
2,#Amsterdam,9,1000.0,180.311,22.627,5.22,1.52,7.579219,14.554,5.239,0.153,0.663,0.024567,0.875009,0.100439,0.250798,0.152442,0.31641,'Herfst in #Amsterdam 🍂🍃🍁 https://t.co/F0VKRmq...,185912,5200,18994,4740,134,563,herfst amsterdam fallen_leafleaf_fluttering_in...,15579,197584,11.9,0.2187,0.476012,"[herfst, amsterdam, fallen_leafleaf_fluttering...","[(herfst, NN), (amsterdam, NN), (fallen_leafle...","[(amsterdam, 993), (netherlands, 80), (day, 64...","(0, 1939)\t1\n (1, 171)\t1\n (2, 1517)\t1\...","(0, 1939)\t1.0\n (1, 171)\t1.0\n (2, 1517)..."
3,#Ankara,57,313.0,185.808307,22.984026,3.974441,1.306709,7.510589,14.747604,5.623003,0.338658,0.840256,0.039681,0.865914,0.094422,0.192013,0.128041,0.241076,'I had a great and very joyful flight today ✈️...,60230,1241,6328,1426,77,224,great joyful flight today airplane nto ankara ...,5302,66000,11.7,0.16446,0.419755,"[great, joyful, flight, today, airplane, nto, ...","[(great, JJ), (joyful, JJ), (flight, NN), (tod...","[(ankara, 251), (turkey, 77), (turkish, 26), (...","(0, 1059)\t1\n (1, 1592)\t1\n (2, 952)\t1\...","(0, 1059)\t1.0\n (1, 1592)\t1.0\n (2, 952)..."
4,#Athens,99,1000.0,186.752,24.268,5.405,1.503,7.170592,15.201,5.442,0.226,0.865,0.042164,0.865646,0.09219,0.176392,0.126462,0.320793,'Georgia GA #GARunoffs\n#Sparta #Athens #Ciara...,192839,5370,20229,4795,180,766,georgia ga garunoffsnsparta athens ciara lukes...,16782,204722,11.5,0.173722,0.448072,"[georgia, ga, garunoffsnsparta, athens, ciara,...","[(georgia, NN), (ga, NN), (garunoffsnsparta, N...","[(athens, 867), (greece, 322), (greek, 74), (a...","(0, 1653)\t1\n (1, 1613)\t1\n (2, 1628)\t1...","(0, 1653)\t1.0\n (1, 1613)\t1.0\n (2, 1628..."


In [87]:
# clean column names
sc100_features.columns = ['city_id', 'rank2020', 'num_tweets','avg_tweet_len', 'avg_num_words','avg_stopwords', 'avg_num_sentences', 'avg_word_len', 'avg_punctuation', 'avg_hashtags', 'avg_numerics', 'avg_upper', 'avg_neg', 'avg_neu', 'avg_pos', 'avg_compound', 'avg_polarity', 'avg_subjectivity', 'tweets_merged', 'words_total', 'stopwords_total', 'punctuation_total','hashtags_total','numerics_total','upper_total','clean_text','clean_words_total','clean_char_total','avg_clean_word_len','polarity_clean','subjectivity_clean','tokens','pos_tag','most_freq100','count_vec','tfidf']
sc100_features.head()

Unnamed: 0,city_id,rank2020,num_tweets,avg_tweet_len,avg_num_words,avg_stopwords,avg_num_sentences,avg_word_len,avg_punctuation,avg_hashtags,avg_numerics,avg_upper,avg_neg,avg_neu,avg_pos,avg_compound,avg_polarity,avg_subjectivity,tweets_merged,words_total,stopwords_total,punctuation_total,hashtags_total,numerics_total,upper_total,clean_text,clean_words_total,clean_char_total,avg_clean_word_len,polarity_clean,subjectivity_clean,tokens,pos_tag,most_freq100,count_vec,tfidf
0,#AbuDhabi,42,832.0,193.930288,25.002404,5.794471,1.435096,7.076026,14.016827,4.602163,0.278846,1.091346,0.021933,0.860132,0.117936,0.299883,0.189999,0.349085,'A Delegation from @aau_ae visited @BurjeelMe...,166602,4811,15703,3268,202,760,delegation aau_ae visited burjeelmedicity one ...,14484,173030,11.2,0.242483,0.471562,"[delegation, aau_ae, visited, burjeelmedicity,...","[(delegation, NN), (aau_ae, NN), (visited, VBD...","[(abudhabi, 677), (uae, 224), (abu, 156), (dub...","(0, 1235)\t1\n (1, 2)\t1\n (2, 5825)\t1\n ...","(0, 1235)\t1.0\n (1, 2)\t1.0\n (2, 5825)\t..."
1,#Abuja,107,828.0,209.646135,27.042271,5.580918,1.437198,7.075491,15.369565,5.78744,0.404589,1.263285,0.018193,0.8564,0.125396,0.356573,0.18698,0.365899,"'Brother #Abuja https://t.co/6J3yOIP8Jm', 'Our...",180295,4561,18553,3862,232,766,brother abuja httpsSkeptical_annoyed_undecided...,15083,182674,11.4,0.249437,0.498789,"[brother, abuja, httpsSkeptical_annoyed_undeci...","[(brother, NN), (abuja, NN), (httpsSkeptical_a...","[(abuja, 635), (lagos, 163), (abujatwittercomm...","(0, 732)\t1\n (1, 24)\t1\n (2, 2539)\t1\n ...","(0, 732)\t1.0\n (1, 24)\t1.0\n (2, 2539)\t..."
2,#Amsterdam,9,1000.0,180.311,22.627,5.22,1.52,7.579219,14.554,5.239,0.153,0.663,0.024567,0.875009,0.100439,0.250798,0.152442,0.31641,'Herfst in #Amsterdam 🍂🍃🍁 https://t.co/F0VKRmq...,185912,5200,18994,4740,134,563,herfst amsterdam fallen_leafleaf_fluttering_in...,15579,197584,11.9,0.2187,0.476012,"[herfst, amsterdam, fallen_leafleaf_fluttering...","[(herfst, NN), (amsterdam, NN), (fallen_leafle...","[(amsterdam, 993), (netherlands, 80), (day, 64...","(0, 1939)\t1\n (1, 171)\t1\n (2, 1517)\t1\...","(0, 1939)\t1.0\n (1, 171)\t1.0\n (2, 1517)..."
3,#Ankara,57,313.0,185.808307,22.984026,3.974441,1.306709,7.510589,14.747604,5.623003,0.338658,0.840256,0.039681,0.865914,0.094422,0.192013,0.128041,0.241076,'I had a great and very joyful flight today ✈️...,60230,1241,6328,1426,77,224,great joyful flight today airplane nto ankara ...,5302,66000,11.7,0.16446,0.419755,"[great, joyful, flight, today, airplane, nto, ...","[(great, JJ), (joyful, JJ), (flight, NN), (tod...","[(ankara, 251), (turkey, 77), (turkish, 26), (...","(0, 1059)\t1\n (1, 1592)\t1\n (2, 952)\t1\...","(0, 1059)\t1.0\n (1, 1592)\t1.0\n (2, 952)..."
4,#Athens,99,1000.0,186.752,24.268,5.405,1.503,7.170592,15.201,5.442,0.226,0.865,0.042164,0.865646,0.09219,0.176392,0.126462,0.320793,'Georgia GA #GARunoffs\n#Sparta #Athens #Ciara...,192839,5370,20229,4795,180,766,georgia ga garunoffsnsparta athens ciara lukes...,16782,204722,11.5,0.173722,0.448072,"[georgia, ga, garunoffsnsparta, athens, ciara,...","[(georgia, NN), (ga, NN), (garunoffsnsparta, N...","[(athens, 867), (greece, 322), (greek, 74), (a...","(0, 1653)\t1\n (1, 1613)\t1\n (2, 1628)\t1...","(0, 1653)\t1.0\n (1, 1613)\t1.0\n (2, 1628..."


In [88]:
# clean city_id by removing # using regex
sc100_features['city_id'] = sc100_features['city_id'].apply(lambda x: re.sub(r'#','', str(x)))

In [89]:
# lower characters
sc100_features['city_id'] = sc100_features['city_id'].str.lower()

In [90]:
sc100_features['num_tweets'] = sc100_features['num_tweets'].apply(np.int64)
sc100_features.head()

Unnamed: 0,city_id,rank2020,num_tweets,avg_tweet_len,avg_num_words,avg_stopwords,avg_num_sentences,avg_word_len,avg_punctuation,avg_hashtags,avg_numerics,avg_upper,avg_neg,avg_neu,avg_pos,avg_compound,avg_polarity,avg_subjectivity,tweets_merged,words_total,stopwords_total,punctuation_total,hashtags_total,numerics_total,upper_total,clean_text,clean_words_total,clean_char_total,avg_clean_word_len,polarity_clean,subjectivity_clean,tokens,pos_tag,most_freq100,count_vec,tfidf
0,abudhabi,42,832,193.930288,25.002404,5.794471,1.435096,7.076026,14.016827,4.602163,0.278846,1.091346,0.021933,0.860132,0.117936,0.299883,0.189999,0.349085,'A Delegation from @aau_ae visited @BurjeelMe...,166602,4811,15703,3268,202,760,delegation aau_ae visited burjeelmedicity one ...,14484,173030,11.2,0.242483,0.471562,"[delegation, aau_ae, visited, burjeelmedicity,...","[(delegation, NN), (aau_ae, NN), (visited, VBD...","[(abudhabi, 677), (uae, 224), (abu, 156), (dub...","(0, 1235)\t1\n (1, 2)\t1\n (2, 5825)\t1\n ...","(0, 1235)\t1.0\n (1, 2)\t1.0\n (2, 5825)\t..."
1,abuja,107,828,209.646135,27.042271,5.580918,1.437198,7.075491,15.369565,5.78744,0.404589,1.263285,0.018193,0.8564,0.125396,0.356573,0.18698,0.365899,"'Brother #Abuja https://t.co/6J3yOIP8Jm', 'Our...",180295,4561,18553,3862,232,766,brother abuja httpsSkeptical_annoyed_undecided...,15083,182674,11.4,0.249437,0.498789,"[brother, abuja, httpsSkeptical_annoyed_undeci...","[(brother, NN), (abuja, NN), (httpsSkeptical_a...","[(abuja, 635), (lagos, 163), (abujatwittercomm...","(0, 732)\t1\n (1, 24)\t1\n (2, 2539)\t1\n ...","(0, 732)\t1.0\n (1, 24)\t1.0\n (2, 2539)\t..."
2,amsterdam,9,1000,180.311,22.627,5.22,1.52,7.579219,14.554,5.239,0.153,0.663,0.024567,0.875009,0.100439,0.250798,0.152442,0.31641,'Herfst in #Amsterdam 🍂🍃🍁 https://t.co/F0VKRmq...,185912,5200,18994,4740,134,563,herfst amsterdam fallen_leafleaf_fluttering_in...,15579,197584,11.9,0.2187,0.476012,"[herfst, amsterdam, fallen_leafleaf_fluttering...","[(herfst, NN), (amsterdam, NN), (fallen_leafle...","[(amsterdam, 993), (netherlands, 80), (day, 64...","(0, 1939)\t1\n (1, 171)\t1\n (2, 1517)\t1\...","(0, 1939)\t1.0\n (1, 171)\t1.0\n (2, 1517)..."
3,ankara,57,313,185.808307,22.984026,3.974441,1.306709,7.510589,14.747604,5.623003,0.338658,0.840256,0.039681,0.865914,0.094422,0.192013,0.128041,0.241076,'I had a great and very joyful flight today ✈️...,60230,1241,6328,1426,77,224,great joyful flight today airplane nto ankara ...,5302,66000,11.7,0.16446,0.419755,"[great, joyful, flight, today, airplane, nto, ...","[(great, JJ), (joyful, JJ), (flight, NN), (tod...","[(ankara, 251), (turkey, 77), (turkish, 26), (...","(0, 1059)\t1\n (1, 1592)\t1\n (2, 952)\t1\...","(0, 1059)\t1.0\n (1, 1592)\t1.0\n (2, 952)..."
4,athens,99,1000,186.752,24.268,5.405,1.503,7.170592,15.201,5.442,0.226,0.865,0.042164,0.865646,0.09219,0.176392,0.126462,0.320793,'Georgia GA #GARunoffs\n#Sparta #Athens #Ciara...,192839,5370,20229,4795,180,766,georgia ga garunoffsnsparta athens ciara lukes...,16782,204722,11.5,0.173722,0.448072,"[georgia, ga, garunoffsnsparta, athens, ciara,...","[(georgia, NN), (ga, NN), (garunoffsnsparta, N...","[(athens, 867), (greece, 322), (greek, 74), (a...","(0, 1653)\t1\n (1, 1613)\t1\n (2, 1628)\t1...","(0, 1653)\t1.0\n (1, 1613)\t1.0\n (2, 1628..."


# extract weigh of BoW for each city_id

In [91]:
# define data transformations
stemmer = PorterStemmer()

def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(text):
    tokens = word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems

vect = CountVectorizer(tokenizer=tokenize, stop_words='english')

# define each BoW as a list of thematic words (Lexicons)
smartcity = ['digitaltransformation','2.0','smartcity','iot','smart','city','ai','smartcities','innovation','cities','5g','data','bigdata','technology','future','tech','digital','digitaltransformation','machinelearning','analytics','solutions','cloud','datascience','network','security','cybersecurity','urban','people','artificialintelligence','smarthome','internetofthings','autonomous','smart','technologies','robotics','software','research','monitoring','street','sensors','futureofwork','planning','governance','transportation','safecity','building','robot','industry40','startups','development','smartthings','urbanplanning','connected','internet','sustainable','waste','emergingtech','programming','environment','safety','lighting','safety','roads','workplaces','smartbuildings','driverless','selfdriving','smartmobility','futureworkspace','digitalcity','deeplearning','datadriven','application','robots','publicsafety','energyefficiency','labgov','citizens','transformation','controlroom','publicspace','airquality','efficiency','era','smarter','urbaninnovation','internetofeverything','newtown','iotforall','construction','privacy','surveillance','drones','smartbuilding','smartgrid','smartgrids','cybersecurity','policy','powergrid','smartcities','iot','electricity','smartmetering','smartmeters','technology','microgrids','microgrid','smartcity','powergrids','iot','ai','machinelearning','bigdata','serverless','cybersecurity','devcommunity','programming','coding','cloudcomputing','codenewbie','100daysofmlcode','womenintech','innovation','deeplearning','security','machine','digital','industrial','artificial','computing','robot','wearables','artificial','networks','drones','connectivity','wireless','urbanplanning','city','urban','cities','smartcities','architecture','urbanism','urbandesign','cityplanning','urbanization','futurecities','urbaninnovation','bigdata','cityplanner','architects','citylab','urbanplanner','urbanplanners','cityplanners','urbandata','townplanning','planners','neighborhoods','sustainablecities','tacticalurbanism','urbanmobility','pedestrians','neighborhood','citiesforpeople','neighbourhood','rethinkreplace','constructionmanagement','suburban','futurecity','cityvisionnaire','smarttechnology','citygoals','civilengineering','digitaltwin','bigdata','digitaltransformation','pedestrian-friendly','surveillance','technologies','urbandevelopment','cities','urban','urbanplanning'] 
civictech = ['aiethics','empowerment','learned','communityengagement','communitydevelopment','nonprofit','notforprofit','collaboration','vulnerable','diversity','egov','egovernance','inclusion','ethics','social','engagement','crowdsourcing','humanrights','citizenship','civic','opendata','referendum','rights','citizens','open','social','people','sharingeconomy','hack','citizens','communities','civictech','inequalities','equity','participatory','social','youthempowerment','govtech','localgov','civic','stateandlocal','community','governmentit','statelocalit','opendata','sharing','including','volunteer','social','people','shared','learn','opengov','localgovernment','education','hackathon','civictechto','techforgood','discussion','communities','local','digitalgov','future','online','governments','webmapping','capacity','accessibility','discuss','ethics','citizens','together','initiative','transparency','data4good','publicinteresttech','democracy','volunteers','society','makers','ecitizenship','eresidency','citizenship','egovernment','esafety','digitaltransformation','policy','social','technology','future','citizen','ecivility','digitaleconomy','ideagov','egov','people','edemocracy','eparticipation','education','citizens','progressive','security','ecitizenship','inclusive','watch','startups','concept','digitalnation','netgov','app','privacy','egovernance','tool','democratic','eublockchain','digitaleu','digitalnomad','apps','residents','populationeu','migrationpolicy','migrationsystem','migration','voted','govtech','smartcities','humandevelopment','erisks','ecitizen','artificial','intelligence','opendata','empowerment','leadership','growth','success','youth','wisdom','education','relationship','powerinpurpose','community','beyondlimits','powerwithin','support','empower','empowering','equality','enlightment','selfdiscovery','confidence','selfempowerment','leaders','forward','diversity','empowered','successmindset','cybersecurity','government','security','technology','opendata','open','openscience','opensource','opendataday','community','citizenscience','citizen','citizenship','openaccess','transparence','openbanking','womenwhocode','openapis','policy','grants','decisions','institutions','opengov','social','busopendata','datadriven','citizen','openstreetmap','researchdata','engagement','citizens']
infrastructure = ['supply','platform','blockchain','system','internet','battery','batterystorage','energystorage','mobility','storage','blockchain','serverless','grids','blockchain','standards','realestate','construction','driverless','selfdriving','smartmobility','transportation','roads', 'building','traffic','autonomousvehicles','streets','platform','transportation','road','publicspace','realestate','monuments','housing','selfdrivingcars','high-speed','real-estate','mobility','highways','transportation','transport','automotive','transit','buses','cars','futureofmobility','car','travel','publictransport','autonomous','bus','auto','shared','traffic','delivery','emissions','5g','services','connectivity','van','trucks','vehicles','commute','electricvehicles','smartmobility','vehicle','logistics','multimodal','connected','metropolitan','efficiency','monitoring','providers','mobile','futurride','autonomousvehicle','engineer','seat','robotics','parking','cycling','driver','flying','rail','moving','wheelchair','automobile','walks','electricvehicle','electriccars','platform','cctv','drivers','driverless','systems','roads','drones','aviation','selfdriving','robots','electriccar','device','bike','autos','urbanmobility','speed','selfdrivingcars','network','street','urbanplanning','autocar','drive','electrification','streets','driverlesscars','bicycle','zero-emission','engineering','artificialintelligence','emobility','fleet','scooters','bikes','journey','manufacturers','route','logistics','driver','trucking','delivery','shipping','traffic','transport','freight','rentals','road','drivers','bus','mobility','rail','cargo','cars','trucks','trucker','fleet','mechanic','infrastructure','railways','diesel','taxi','truckdriver','truckers','shipments','driving','flights','pickup','airfreight','intelligentlogistics','bicycle','cycling','port','wheels','motors','machinery','ferry','infrastructure','bridge','network','architecture','building','construction','transport','metro','roads','highways','railways','realestate','housing','train','civilengineering','airport','station','connectivity','stations','tunnel','airports','equipment','bridges','km','electrical','highway','tunnels','road']
sustainability = ['resource','resources','sustainablefinance','cyberresilience','recycling','resilient','batteries','emission','resilience','energytransition','sustainability','biodiversity','sustainabledevelopment','resources','sustainableurbanplanning','sustainabletransport','sustainable','environment','climatechange','energy','climate','innovation','circulareconomy','environmental','ecofriendly','nature','planet','carbon','recycling','climateaction','plastic','renewableenergy','zerowaste','water','waste','solar','agriculture','recycle','savetheplanet','pollution','emissions','sustainablefashion','hydrogen','eco','sustainableliving','reuse','clean','recycled','greenliving','renewable','circular','gogreen','biodiversity','sustainable','renewables','climatecrisis','organic','eco-friendly','earth','sustainabledevelopment','electric','electricity','efficiency','electricvehicles','recyclable','footprint','sustainablefinance','ocean','greener','co2','plasticfree','climateemergency','greenenergy','waterconservation','carbonfootprint','greenbuilding','energytransition','sustainablity','consumption','globalwarming','resilience','reusable','battery','decompose','aquaculture','oil','compost','circularity','plastics','environment','ecoconscious','savetheplanet','green_action','eco-outreach','eco-friendly','environmental_activism','climateactivism','energyefficiency','agtech','reduction','lessismore','responsibility','🌍','cleanup','cleantech','ethical','sustainably','ecofashion','foodtech','emobility','ecosystem','passivehouse','farmtotable','nature','wildlife','planet','ecofriendly','trees','conservation','climatecrisis','savetheplanet','clean','recycle','natural','biodiversity','tree','plants','emissions','reduce','ecology','zerowaste','forests','globalwarming','garden','species','gogreen','farming','agriculture','eco','plasticpollution','soil','greener','plasticfree','farmers','extinction','reuse','recycled','renewableenergy','circulareconomy','environmentally','deforestation','oceans','fuel','naturelovers','microplastics','airpollution','rivers','plastics','renewables','climateemergency','gas','sustainablefashion','environmentaljustice','eco-friendly','bees','fossil','wastewater','fornature','wildfires','greenhouse','ecosystems','renewables','oilandgas','hydrogen','cleanenergy','greenspaces','resilience','climateemergency','sustainablemobility','sustainabledevelopment','naturebasedsolutions','energyefficiency','actonclimate','solar','renewableenergy','alternativeenergy','solarpower','sustainable','environment','solarpanels','solarpv','scienceandenvironment','solarsystems','solarenergy','energymanagement','greennewdeal','solarenergy']
governance = ['partnership','strategies','leadership','responsibility','regulations','ceo','policy','democracy2.0','elections','voter','collaboration','gov','governance','planning','management','localgov','project','partnership','partners','partner','leader','leaders','politicalparties','administration','decentralized','democracy2.0','political','society','minister','vision','crossbordercooperation','political','politics','democratic','government','politicians','elections','voter','public','commonwealth','politicalprogress','voting','debate','national','votingawareness','president','mayor','governance','parliament','campaign','liquiddemocracy','voters','brigade','constitutional','law','country','legalhack','decentralization','representatives','communityactivity','republican','policing','policy','policies','administration','governance','leadership','management','security','cybersecurity','ceo','corpgov','government','public','strategy','corruption','democracy','boardofdirectors','cio','policy','technology','leaders','boards','role','members','report','planning','throwback','organizations','national','directorship','structures','trust','directors','decentralized','countries','chair','discussion','conference','transparency','compensationcommittee','auditcommittee','cto','visionarytalks','riskmanagement','organisations','office','legal','regulations','audit','insights','corporategovernance','organization','voting','governing','organisation','policies','project','executive','responsible','leader','survey','interview','approach','staff','benefits','recruitment','regulatory','reporting','severity','obstacle','presidential','country','elections','institutions','alternative','digitaltransformation','freedom','nation','datavault','datawarehouse','regulation','processes','foundation','critical','control','candidate','promised','inauguration','anticorruption','executives','publicpolicy','principles','strategic','sharing','governments','federated','elected','framework','schedule','risks','certificate','meritocracy','stakeholders','partnership','society','equality','decisions','structure','implement','commitment','decision','driven','citiesalliance','assessment','corporations','cryptocurrencies','dataprotection']
entrepreneurship = ['freemarket','investors','crypto','financial','accountability','enterprise','assetmanagement','accountable','company','finance','corporate','contractors','workforce','digitalbanking','entrepreneur','newmarket','investment', 'market','coding','startup','development','maker','fintech','economy','business','economic','finance','markets','investing','stocks','investment','gdp','money','financial','economics','trade','crypto','businesses','investors','retail','bitcoin','employment','job','startups','growth','work','entrepreneur','unemployment','entrepreneurship','sales','manufacturing','working','corporate','companies','funds','jobless','cryptocurrency','economicrecovery','smallbusiness','workers','investments','traders','budget','assets','growing','industries','asset','invest','startup','funding','development','currency','banks','monetary','debt','skills','fintech','fiscal','entrepreneurs','deficit','banking','benefits','rates','paying','billing','billingsoftware','earnings','workfromhome','accounting','personalfinance','ratings','accountant','treasury','equities','subsidies','opportunities','profits','stockstowatch','branding','marketing','customer','firms','localbusiness','capitalism','consumers','startup','startups','innovation','mindset','smallbusiness','businessowner','entrepreneurlife','digitalmarketing','management','founder','branding','growth','company','founders','funding','product','businesses','businessgrowth','sales','coaching','investment','entrepreneurial','venture','businessangel','capital','smallbusinessowner','businessman','customers','businesscoach','speaker','finance','market','skills','coding','ecommerce','entrepreneurmindset','startuplife','pitch','onlinebusiness','employment','businesswoman','socialmediamarketing','girlswhocode','crypto','womeninbusiness','investors','selfemployment','accelerator','incubator','designthinking','growthhacking','enterprise','changemakers','startupbusiness','businessideas','mentors','businesscoaching','businessmotivation','businessgoals','ventures','skill','innovators','prototype','growthmindset','pitching','innovations','businesslife','businessstrategy','smallbusinessowners','financial','companies','entrepreneurlifestyle','franchise','manufactures','beyourownboss']

In [92]:
sma_vec = vect.fit(smartcity)
sc100_features['sma_bow'] = sc100_features['clean_text'].apply(lambda x: np.sum(sma_vec.transform([x]).toarray()))

In [93]:
civ_vec = vect.fit(civictech)
sc100_features['civ_bow'] = sc100_features['clean_text'].apply(lambda x: np.sum(civ_vec.transform([x]).toarray()))

In [94]:
inf_vec = vect.fit(infrastructure)
sc100_features['inf_bow'] = sc100_features['clean_text'].apply(lambda x: np.sum(inf_vec.transform([x]).toarray()))

In [95]:
sus_vec = vect.fit(sustainability)
sc100_features['sus_bow'] = sc100_features['clean_text'].apply(lambda x: np.sum(sus_vec.transform([x]).toarray()))

In [96]:
gov_vec = vect.fit(governance)
sc100_features['gov_bow'] = sc100_features['clean_text'].apply(lambda x: np.sum(gov_vec.transform([x]).toarray()))

In [97]:
ent_vec = vect.fit(entrepreneurship)
sc100_features['ent_bow'] = sc100_features['clean_text'].apply(lambda x: np.sum(ent_vec.transform([x]).toarray()))

In [98]:
# avg smartcity weight
sc100_features['avg_sma_bow'] = sc100_features['sma_bow']/sc100_features['num_tweets']

In [99]:
# avg civictech weight
sc100_features['avg_civ_bow'] = sc100_features['civ_bow']/sc100_features['num_tweets']

In [100]:
# avg infrastructure weight
sc100_features['avg_inf_bow'] = sc100_features['inf_bow']/sc100_features['num_tweets']

In [101]:
# avg sustainability weight
sc100_features['avg_sus_bow'] = sc100_features['sus_bow']/sc100_features['num_tweets']

In [102]:
# avg governance weight
sc100_features['avg_gov_bow'] = sc100_features['gov_bow']/sc100_features['num_tweets']

In [103]:
# avg entrepreneurship weight
sc100_features['avg_ent_bow'] = sc100_features['ent_bow']/sc100_features['num_tweets']

In [104]:
sc100_features.head()

Unnamed: 0,city_id,rank2020,num_tweets,avg_tweet_len,avg_num_words,avg_stopwords,avg_num_sentences,avg_word_len,avg_punctuation,avg_hashtags,avg_numerics,avg_upper,avg_neg,avg_neu,avg_pos,avg_compound,avg_polarity,avg_subjectivity,tweets_merged,words_total,stopwords_total,punctuation_total,hashtags_total,numerics_total,upper_total,clean_text,clean_words_total,clean_char_total,avg_clean_word_len,polarity_clean,subjectivity_clean,tokens,pos_tag,most_freq100,count_vec,tfidf,sma_bow,civ_bow,inf_bow,sus_bow,gov_bow,ent_bow,avg_sma_bow,avg_civ_bow,avg_inf_bow,avg_sus_bow,avg_gov_bow,avg_ent_bow
0,abudhabi,42,832,193.930288,25.002404,5.794471,1.435096,7.076026,14.016827,4.602163,0.278846,1.091346,0.021933,0.860132,0.117936,0.299883,0.189999,0.349085,'A Delegation from @aau_ae visited @BurjeelMe...,166602,4811,15703,3268,202,760,delegation aau_ae visited burjeelmedicity one ...,14484,173030,11.2,0.242483,0.471562,"[delegation, aau_ae, visited, burjeelmedicity,...","[(delegation, NN), (aau_ae, NN), (visited, VBD...","[(abudhabi, 677), (uae, 224), (abu, 156), (dub...","(0, 1235)\t1\n (1, 2)\t1\n (2, 5825)\t1\n ...","(0, 1235)\t1.0\n (1, 2)\t1.0\n (2, 5825)\t...",309,296,335,257,310,335,0.371394,0.355769,0.402644,0.308894,0.372596,0.402644
1,abuja,107,828,209.646135,27.042271,5.580918,1.437198,7.075491,15.369565,5.78744,0.404589,1.263285,0.018193,0.8564,0.125396,0.356573,0.18698,0.365899,"'Brother #Abuja https://t.co/6J3yOIP8Jm', 'Our...",180295,4561,18553,3862,232,766,brother abuja httpsSkeptical_annoyed_undecided...,15083,182674,11.4,0.249437,0.498789,"[brother, abuja, httpsSkeptical_annoyed_undeci...","[(brother, NN), (abuja, NN), (httpsSkeptical_a...","[(abuja, 635), (lagos, 163), (abujatwittercomm...","(0, 732)\t1\n (1, 24)\t1\n (2, 2539)\t1\n ...","(0, 732)\t1.0\n (1, 24)\t1.0\n (2, 2539)\t...",247,201,296,67,290,398,0.298309,0.242754,0.357488,0.080918,0.350242,0.480676
2,amsterdam,9,1000,180.311,22.627,5.22,1.52,7.579219,14.554,5.239,0.153,0.663,0.024567,0.875009,0.100439,0.250798,0.152442,0.31641,'Herfst in #Amsterdam 🍂🍃🍁 https://t.co/F0VKRmq...,185912,5200,18994,4740,134,563,herfst amsterdam fallen_leafleaf_fluttering_in...,15579,197584,11.9,0.2187,0.476012,"[herfst, amsterdam, fallen_leafleaf_fluttering...","[(herfst, NN), (amsterdam, NN), (fallen_leafle...","[(amsterdam, 993), (netherlands, 80), (day, 64...","(0, 1939)\t1\n (1, 171)\t1\n (2, 1517)\t1\...","(0, 1939)\t1.0\n (1, 171)\t1.0\n (2, 1517)...",391,253,348,105,212,295,0.391,0.253,0.348,0.105,0.212,0.295
3,ankara,57,313,185.808307,22.984026,3.974441,1.306709,7.510589,14.747604,5.623003,0.338658,0.840256,0.039681,0.865914,0.094422,0.192013,0.128041,0.241076,'I had a great and very joyful flight today ✈️...,60230,1241,6328,1426,77,224,great joyful flight today airplane nto ankara ...,5302,66000,11.7,0.16446,0.419755,"[great, joyful, flight, today, airplane, nto, ...","[(great, JJ), (joyful, JJ), (flight, NN), (tod...","[(ankara, 251), (turkey, 77), (turkish, 26), (...","(0, 1059)\t1\n (1, 1592)\t1\n (2, 952)\t1\...","(0, 1059)\t1.0\n (1, 1592)\t1.0\n (2, 952)...",49,73,67,20,106,104,0.15655,0.233227,0.214058,0.063898,0.338658,0.332268
4,athens,99,1000,186.752,24.268,5.405,1.503,7.170592,15.201,5.442,0.226,0.865,0.042164,0.865646,0.09219,0.176392,0.126462,0.320793,'Georgia GA #GARunoffs\n#Sparta #Athens #Ciara...,192839,5370,20229,4795,180,766,georgia ga garunoffsnsparta athens ciara lukes...,16782,204722,11.5,0.173722,0.448072,"[georgia, ga, garunoffsnsparta, athens, ciara,...","[(georgia, NN), (ga, NN), (garunoffsnsparta, N...","[(athens, 867), (greece, 322), (greek, 74), (a...","(0, 1653)\t1\n (1, 1613)\t1\n (2, 1628)\t1...","(0, 1653)\t1.0\n (1, 1613)\t1.0\n (2, 1628...",248,324,265,103,265,317,0.248,0.324,0.265,0.103,0.265,0.317


In [105]:
# Save new .csv file with lastly created columns
sc100_features.to_csv('sc100_features.csv', index=False)