## 2. Exploratory Data Analysis 

In this section, our aim is to get some insights about the dataset which we cleaned in the previous section. 

We will find the most common words and represent it using *matplotlib* and *wordclouds* modules to get an idea about the topics around which the conversation revolves in the show. Later on, we can do the same for TBBT and compare the difference in languages, if any. 

[Matplotlib Documentation](https://matplotlib.org/)<br>
[Word Cloud Documentation](https://amueller.github.io/word_cloud/)

In [None]:
data_t = data_dtm.T
data_t.head()

In [None]:
# Top 10 words in each episode


top_dict = {}
for c in data_t.columns:
    top = data_t[c].sort_values(ascending=False).head(10)
    top_dict[c]= list(zip(top.index, top.values))

top_dict

We can see that there are a lot of top words are those words which are generally used in day-to-day conversation. These words will not be helpful in finding meaningful insights. So I have got rid of them using the text module of scikit-learn. There is already a list of such words called English_Stop_Words. To this list, we'd be adding the most common words found above. 

Further, we would also be getting rid of profanity since it's abundent in our transcripts and don't help in our analysis. 
To do so, I have downloaded a file containing bad_words made available by folks at CMU. [This is the list](https://www.cs.cmu.edu/~biglou/resources/)   Not for the faint of heart, though !!

In [None]:
from sklearn.feature_extraction import text 

stop_words = text.ENGLISH_STOP_WORDS

In [None]:
words = []
for item in stop_words:
    words.append(item)
    
len(words)

In [None]:
# To above list of 318 words add the most common words.  
for episode in episodes:
    for t in top_dict[episode]:
        words.append(t[0])
        
len(words)

In [None]:
# Reading the bad-words file

bad_words = pd.read_csv(r'E:\Silicon Valley\bad-words.txt')
type(bad_words)

In [None]:
for item in bad_words.abbo:
    words.append(item)
    
len(words)

In [None]:
# Now we have 1780 words  which need to be removed from our DTM 
# Recreate document-term matrix

cv = CountVectorizer(stop_words=words)
data_cv = cv.fit_transform(data.transcripts)
data_stop = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_stop.index = data.index
data_stop

In [None]:
#Importing dependancies and creating wordclouds for each episode

import matplotlib.pyplot as plt
%matplotlib inline
from wordcloud import WordCloud

fig = plt.figure(figsize = (15,5) )

for index , episode in enumerate(episodes):
    wordcloud = WordCloud(stopwords = words,max_font_size=50, background_color='white').generate(data.transcripts[episode])
    
    plt.subplot(2, 4, index+1)
    
    plt.imshow(wordcloud, interpolation="bilinear")
    
    plt.axis("off")
    plt.title(episode)
    
    
plt.show()


We see that some of the themes  areound which the  conversation revolves are - company, money, world, obsessed, vision, big etc. 
These words might be used by ambitious people trying to build something. Especially, in the tech industry. This is one of the reasons why I have been attracted to the show as I personally like to talk about these things too.  


Next up, lets's conduct a Sentiment Analysis on the tweets about the show and see what we can find. 

## 3.Sentiment Analysis 

In order to perform Sentiment Analysis we will be using tweepy module to collect tweets and textblob to perform analysis. 

[Tweepy Documentation](https://tweepy.readthedocs.io/en/latest/)<br>
[TextBlob Documentation](https://textblob.readthedocs.io/en/dev/)

In [None]:
import tweepy
from textblob import TextBlob

access_token = '1133364060718678017-NlAtZqnmW1XQrHiFRFaSn2pkgg9bjN'
access_token_secret = 'nwRvtguHBdlAYESq417sIbOgUiiCioHWrwye2vLCIGcBZ'
consumer_key = 'Mup6xb3v5b3mL85kmESyU4BDC'
consumer_secret = 'nr1zsiK4Mazq8H677CPUFi6C0sal4LFdtIyqDnOLCNxIbUkI0T'


In [None]:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)



In [None]:
#Searching tweets containing 'Silicon Valley HBO'

public_tweets = api.search('Silicon Valley HBO', lang = 'en' , count = 100)
print(public_tweets)

In [None]:
#Getting the polarity and subjectivity for each tweet

for tweet in public_tweets:
    print(tweet.text)
    analysis = TextBlob(tweet.text)
    print(analysis.sentiment)


In [None]:
#Creating a dataframe of tweets

tweet_data = pd.DataFrame(data=[tweet.text for tweet in public_tweets], columns=['Tweets'])

display(tweet_data.sample(10))
tweet_data.shape

One another approach I have used is to use **Vader Sentiment Analysis** which gives us the polarity in terms terms positive, negative , neutral and a overall compound sentiment. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. 

In [None]:


from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyser = SentimentIntensityAnalyzer()

list = []

for index, row in tweet_data.iterrows():
    score = analyser.polarity_scores(row['Tweets'])
    list.append(score)
    
list_series = pd.Series(list)
tweet_data['polarity'] = list_series.values

display(tweet_data.head(10))

We can see that there is barely any negative sentiment around the show and most tweets are either neutral or positive. This might explains why the show has been very popular. 

In [None]:
#Saving the above  dataframe for future use
tweet_data.to_csv('E:\Silicon Valley\Silicon_Valley_VaderSentiment.csv')

Till now, I have tried to answer the first two of the three question I wanted to answer. For the last bit I will be repeating the same steps for TBBT and making comparisons. It will be carried out on a seperate notebook. 


Cheers, Always Blue!Always Blue!Always Blue!Always Blue! 