# 1. Data Collection and Data Cleaning

##  1.1 Data Collection

Our data is going to be transcripts of the show. Guys at  https://www.springfieldspringfield.co.uk/  have done a wonderful job in making transcripts for all the episodes. We will be using all the transcripts for the Season 1. (total 8 episodes so 8 transcripts).

In order to collect data we will be doing some web scraping using the Beautiful Soup package for parsing HTML and XML documents in python. 

[Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In [1]:
#Importing packages

import requests
from bs4 import BeautifulSoup
import pickle
from requests import get

In [2]:
#Creating a list of links to scrape

urls = []
for i in range(1,9):
    url = 'https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=silicon-valley-2014&episode=s01e0'+str(i)
    urls.append(url)
    
urls

['https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=silicon-valley-2014&episode=s01e01',
 'https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=silicon-valley-2014&episode=s01e02',
 'https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=silicon-valley-2014&episode=s01e03',
 'https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=silicon-valley-2014&episode=s01e04',
 'https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=silicon-valley-2014&episode=s01e05',
 'https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=silicon-valley-2014&episode=s01e06',
 'https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=silicon-valley-2014&episode=s01e07',
 'https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=silicon-valley-2014&episode=s01e08']

In [3]:
#Perform scraping 

def url_to_transcript(url):
    page = requests.get(url).text
    soup = BeautifulSoup(page, "lxml")
    text = [soup.find('div','scrolling-script-container').text]
    print(url)
    return text

In [4]:
#Storing all the transcripts in a list. This is the raw data on which we have to work. Very Unclean. Sigh ! 

transcripts = [url_to_transcript(u) for u in urls]
transcripts

https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=silicon-valley-2014&episode=s01e01
https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=silicon-valley-2014&episode=s01e02
https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=silicon-valley-2014&episode=s01e03
https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=silicon-valley-2014&episode=s01e04
https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=silicon-valley-2014&episode=s01e05
https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=silicon-valley-2014&episode=s01e06
https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=silicon-valley-2014&episode=s01e07
https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=silicon-valley-2014&episode=s01e08


[['\r\n\r\n\r\n                    \t\t\t Whoo. Yeah. Somebody make some motherfucking noise in here! Fuck these people. Man, this place is unbelievable. Fucking Goolybib, man. Those guys build a mediocre piece of software, that might be worth something someday, and now they live here. There\'s money flying all over Silicon Valley but none of it ever seems to hit us. What the hell are you eating? Liquid shrimp. It\'s 200 dollars a quart. Wylie Dufresne made it. How does it taste? Like how I would imagine cum tastes. You guys taking it all in? Because this is what it looks like when Google acquires your company for over 200 million dollars. Look Dustin Moskovitz. Elon Musk. Eric Schmidt. Whatever the fuck the guy\'s name is who created Photrio. I mean, Kid Rock is the poorest person here. Apart from you guys. Ok, there\'s 40 billion dollars of net worth, walking around this party. And you guys are standing around drinking shrimp and talking about what cum tastes like. Yeah, I heard that

In [5]:
#Creating a list of episodes for easy reference

episodes  = []
for k in range(1,9):
    ep = 'e'+str(k)
    episodes.append(ep)
    
episodes

['e1', 'e2', 'e3', 'e4', 'e5', 'e6', 'e7', 'e8']

In [6]:
#Creating a dictionary for easy readability (episode:transcript)

data_dict = {k:v for k,v in zip(episodes,transcripts)}
data_dict

{'e1': ['\r\n\r\n\r\n                    \t\t\t Whoo. Yeah. Somebody make some motherfucking noise in here! Fuck these people. Man, this place is unbelievable. Fucking Goolybib, man. Those guys build a mediocre piece of software, that might be worth something someday, and now they live here. There\'s money flying all over Silicon Valley but none of it ever seems to hit us. What the hell are you eating? Liquid shrimp. It\'s 200 dollars a quart. Wylie Dufresne made it. How does it taste? Like how I would imagine cum tastes. You guys taking it all in? Because this is what it looks like when Google acquires your company for over 200 million dollars. Look Dustin Moskovitz. Elon Musk. Eric Schmidt. Whatever the fuck the guy\'s name is who created Photrio. I mean, Kid Rock is the poorest person here. Apart from you guys. Ok, there\'s 40 billion dollars of net worth, walking around this party. And you guys are standing around drinking shrimp and talking about what cum tastes like. Yeah, I hear

## 1.2 Data Cleaning

Before cleaning our data, we will be storing it in a DataFrame since it's easier to manipulate in that format. We'll be making use of Pandas library for this purpose.<br> 
[Pandas Documentation](https://pandas.pydata.org/pandas-docs/stable/)

For string manipulation we will be making use of re(Regular expression) and string libraries. <br>
[re Documentation](https://docs.python.org/3/library/re.html)<br>
[String Documentation](https://docs.python.org/3/library/string.html)

In [7]:
#Importing packages

import pandas as pd
import re
import string

In [8]:
#Creating a dataframe

df = pd.DataFrame.from_dict(data_dict).transpose()
df.columns = ['transcripts']
df

Unnamed: 0,transcripts
e1,\r\n\r\n\r\n \t\t\t Whoo. Y...
e2,\r\n\r\n\r\n \t\t\t Holy sh...
e3,\r\n\r\n\r\n \t\t\t The gre...
e4,\r\n\r\n\r\n \t\t\t Richie!...
e5,\r\n\r\n\r\n \t\t\t It's a ...
e6,\r\n\r\n\r\n \t\t\t Kidney ...
e7,\r\n\r\n\r\n \t\t\t The cor...
e8,\r\n\r\n\r\n \t\t\t I'll ri...


In [9]:
#Quick check of the dataframe to see if things look right

df.loc['e1', 'transcripts']

'\r\n\r\n\r\n                    \t\t\t Whoo. Yeah. Somebody make some motherfucking noise in here! Fuck these people. Man, this place is unbelievable. Fucking Goolybib, man. Those guys build a mediocre piece of software, that might be worth something someday, and now they live here. There\'s money flying all over Silicon Valley but none of it ever seems to hit us. What the hell are you eating? Liquid shrimp. It\'s 200 dollars a quart. Wylie Dufresne made it. How does it taste? Like how I would imagine cum tastes. You guys taking it all in? Because this is what it looks like when Google acquires your company for over 200 million dollars. Look Dustin Moskovitz. Elon Musk. Eric Schmidt. Whatever the fuck the guy\'s name is who created Photrio. I mean, Kid Rock is the poorest person here. Apart from you guys. Ok, there\'s 40 billion dollars of net worth, walking around this party. And you guys are standing around drinking shrimp and talking about what cum tastes like. Yeah, I heard that. 

Data Cleaning can be a long process. However, as a first pass I will be performing the following steps in prder to clean the data : 
        1. Convert all text to lowercase
        2. Remove whitespace characters
        3. Remove punctuation marks
        4. Remove terms containing numbers

In [10]:
# Making a function to peroform the above mentioned steps

def clean_text_pass1(text):
    text = text.lower()               
    text = re.sub('\s+',' ',text)     
    text = re.sub('[%s]' %re.escape(string.punctuation) , '', text)     
    text = re.sub('\w*\d\w*', '', text)                          
    return text
    

In [11]:
# Checking if it works well

clean_text_pass1(df.loc['e1', 'transcripts'])

' whoo yeah somebody make some motherfucking noise in here fuck these people man this place is unbelievable fucking goolybib man those guys build a mediocre piece of software that might be worth something someday and now they live here theres money flying all over silicon valley but none of it ever seems to hit us what the hell are you eating liquid shrimp its  dollars a quart wylie dufresne made it how does it taste like how i would imagine cum tastes you guys taking it all in because this is what it looks like when google acquires your company for over  million dollars look dustin moskovitz elon musk eric schmidt whatever the fuck the guys name is who created photrio i mean kid rock is the poorest person here apart from you guys ok theres  billion dollars of net worth walking around this party and you guys are standing around drinking shrimp and talking about what cum tastes like yeah i heard that you guys live in my incubator youve got to network thats why i brought you here i got u

In [12]:
#looks good. So we'll apply the function on our dataset

data = df.transcripts.apply(clean_text_pass1)
data

e1     whoo yeah somebody make some motherfucking no...
e2     holy shit  uh  what the fuck is that uh that ...
e3     the greatness of human accomplishment has alw...
e4     richie  right on time  hey youre the lawyer r...
e5     its a fucking sketchy neighborhood man you se...
e6     kidney function liver function testosterone i...
e7     the core compression algorithm is optimal all...
e8     ill rip your dick off you son of a bitch my e...
Name: transcripts, dtype: object

## 1.2.1 Organizing the data 

For analysis purpose we need our data to be arranged in two formats.
   1. **Corpus** - A collection of text
   2. **Document-Term Matrix** - Word count in matrix format

### Corpus

In [13]:
#The clean dataframe created above is our corpus

data = pd.DataFrame(data) 
data                                         

Unnamed: 0,transcripts
e1,whoo yeah somebody make some motherfucking no...
e2,holy shit uh what the fuck is that uh that ...
e3,the greatness of human accomplishment has alw...
e4,richie right on time hey youre the lawyer r...
e5,its a fucking sketchy neighborhood man you se...
e6,kidney function liver function testosterone i...
e7,the core compression algorithm is optimal all...
e8,ill rip your dick off you son of a bitch my e...


### Document-Term Matrix

In order to create our Document-Term Matrix we will be using CountVectorizer module from scikit-Learn. This will help us in tokenizing the dataset and creating a matrix where each row represents a different document and each column represents a different word. The values in the matrix represent the number of times the word has appeared in the corresponding document. 

[CountVectorizer Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)


In [14]:

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
data_cv = cv.fit_transform(data.transcripts)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index = data.index
data_dtm

Unnamed: 0,abide,ability,able,abound,abrupt,absolutely,absurd,abuse,abuzz,accept,...,yup,zenella,zero,zeroes,zeros,zimmerman,zips,zone,zones,zuckerberg
e1,0,0,2,0,0,0,0,0,0,0,...,1,0,0,1,1,0,0,0,0,0
e2,0,0,0,0,0,0,0,0,0,1,...,0,0,1,1,0,0,0,0,0,0
e3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
e4,0,0,4,0,1,0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
e5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
e6,1,0,1,1,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0
e7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,2,0,0
e8,0,1,2,0,0,2,0,0,1,0,...,0,1,0,0,0,0,1,0,0,0
