# Star Trek TNG - Master Clean Up 
---
## INTRODUCTION

- What is Star Trek? (a short history lesson)

- Goal: Determine the best roster of characters using NLP and sentiment analysis to predict the highest rating on IMDB. 
    - Elaborate on applications in the real world. 

- Data Acquistition (link to GitHub) 

- Goal in this notebook is to clean the dataset, and preprocess it for modeling, along with some handy visualizations. 



   

In [1]:
import pandas as pd 
import numpy as np 
import matplotlib as plt
import seaborn as sns
import dataframe_image as dfi

In [2]:
# read in the dataset
startrek_data = pd.read_csv(r'C:\Users\Katya\Documents\data\TNG.csv.gz', encoding= 'latin-1')

In [26]:
# taking a quick peek at the data
startrek_data.shape

(110176, 18)

In [4]:
# filtered to only rows where a charactor spoke in the episode
startrekspeech = startrek_data.loc[startrek_data['type'] == 'speech']

# take a look
startrekspeech.sample(3)

Unnamed: 0.1,Unnamed: 0,episode,productionnumber,setnames,characters,act,scenenumber,scenedetails,partnumber,type,who,text,speechdescription,Released,Episode,imdbRating,imdbID,Season
971,972,encounter at farpoint,,,,OTHER,203,,972,speech,ZORN,We haven't done anything wrong!,False,1987-09-26,1.0,6.9,tt0094030,1.0
77638,77639,realm of fear,#40276-228,"USS ENTERPRISE,USS ENTERPRISE,MAIN BRIDGE,CAPT...","PICARD,BARCLAY,RIKER,ADMIRAL HAYES,DATA,SCIENC...",FIVE,45,,569,speech,BARCLAY,But... isn't there a way to... to... to force...,False,1992-09-26,2.0,7.2,tt0708761,6.0
36942,36943,a matter of perspective,#40273-162,"USS ENTERPRISE,USS ENTERPRISE,MAIN BRIDGE,CAPT...","PICARD,KRAG,RIKER,APGAR,DATA,MANUA,BEVERLY,TAY...",TWO,38A,,258,speech,PICARD,Not to my knowledge.,False,1990-02-10,14.0,6.8,tt0708671,3.0


In [5]:
StarTrek_Cols = startrekspeech.filter(['episode', 'type', 'who', 'Episode', 'imdbRating', 'Season', 'text'], axis=1)

StarTrek_Cols.head(10)

Unnamed: 0,episode,type,who,Episode,imdbRating,Season,text
1,encounter at farpoint,speech,PICARD V.O.,1.0,6.9,1.0,"Captain's log, stardate 42353.7. Our destinat..."
3,encounter at farpoint,speech,PICARD V.O.,1.0,6.9,1.0,"My orders are to examine Farpoint, a starbase..."
5,encounter at farpoint,speech,PICARD V.O.,1.0,6.9,1.0,"acquainted with my new command, this Galaxy C..."
7,encounter at farpoint,speech,PICARD V.O.,1.0,6.9,1.0,I am still somewhat in awe of its size and co...
9,encounter at farpoint,speech,PICARD V.O.,1.0,6.9,1.0,"several key positions, most notably ..."
12,encounter at farpoint,speech,PICARD,1.0,6.9,1.0,"You will agree, Data, that Starfleet's instru..."
13,encounter at farpoint,speech,DATA,1.0,6.9,1.0,Difficult ... how so? Simply solve the myster...
14,encounter at farpoint,speech,PICARD,1.0,6.9,1.0,As simple as that.
15,encounter at farpoint,speech,TROI,1.0,6.9,1.0,Farpoint Station. Even the name sounds myster...
16,encounter at farpoint,speech,PICARD,1.0,6.9,1.0,"The problem, Data, is that another life form ..."


In [6]:
# show a list of all the unique values within the 'who' column in order to see what needs to be cleaned. 
plt.figure()
plt.pie()
plt.show()

# pie chart of before and after cleaning/filtering goes here

StarTrek_Cols['who']

1          PICARD V.O.
3          PICARD V.O.
5          PICARD V.O.
7          PICARD V.O.
9          PICARD V.O.
              ...     
110167            DATA
110168          PICARD
110170          PICARD
110171            TROI
110173          PICARD
Name: who, Length: 68487, dtype: object

# To Do 

Filter the dataset to only have these characters:

- Main Characters:
    - Jean-Luc Picard
    - William Riker
    - Geordi La Forge
    - Tasha Yar
    - Worf
    - Beverly Crusher
    - Deanna Troi
    - Data
    - Wesley Crusher
    - Katherine Pulaski
    - Q

Remove comm voices, voice overs (V.O) and (O.S) and everything else

In [10]:
StarTrek_Cols['who'] = StarTrek_Cols['who'].str.replace(' ', '')

In [21]:
characters = ['PICARD','WORF', 'BEVERLY', 'GEORDI', 'TROI', 'TASHA', 'DATA', 'WESLEY', 'PULASKI', 'Q', 'RIKER']

StarTrek_Final = StarTrek_Cols[StarTrek_Cols['who'].isin(characters)].copy()

In [22]:
StarTrek_Final['who'].unique()

array(['PICARD', 'DATA', 'TROI', 'WORF', 'TASHA', 'RIKER', 'WESLEY',
       'BEVERLY', 'GEORDI', 'PULASKI', 'Q'], dtype=object)

In [25]:
StarTrek_Final.shape

(44460, 7)

In [23]:
StarTrek_Cols.loc[StarTrek_Cols['who'].isin(characters)]

Unnamed: 0,episode,type,who,Episode,imdbRating,Season,text
12,encounter at farpoint,speech,PICARD,1.0,6.9,1.0,"You will agree, Data, that Starfleet's instru..."
13,encounter at farpoint,speech,DATA,1.0,6.9,1.0,Difficult ... how so? Simply solve the myster...
14,encounter at farpoint,speech,PICARD,1.0,6.9,1.0,As simple as that.
15,encounter at farpoint,speech,TROI,1.0,6.9,1.0,Farpoint Station. Even the name sounds myster...
16,encounter at farpoint,speech,PICARD,1.0,6.9,1.0,"The problem, Data, is that another life form ..."
...,...,...,...,...,...,...,...
110167,all good things...,speech,DATA,25.0,8.5,7.0,Would you care to deal?
110168,all good things...,speech,PICARD,25.0,8.5,7.0,Oh... thank you.
110170,all good things...,speech,PICARD,25.0,8.5,7.0,I should have done this a long time ago. I wa...
110171,all good things...,speech,TROI,25.0,8.5,7.0,You were always welcome.


In [13]:
characters = ['PICARD','WORF']
StarTrek_Cols[StarTrek_Cols['who'].isin(characters)]

Unnamed: 0,episode,type,who,Episode,imdbRating,Season,text
12,encounter at farpoint,speech,PICARD,1.0,6.9,1.0,"You will agree, Data, that Starfleet's instru..."
14,encounter at farpoint,speech,PICARD,1.0,6.9,1.0,As simple as that.
16,encounter at farpoint,speech,PICARD,1.0,6.9,1.0,"The problem, Data, is that another life form ..."
18,encounter at farpoint,speech,PICARD,1.0,6.9,1.0,"Data, how can you be programmed as a virtual ..."
21,encounter at farpoint,speech,PICARD,1.0,6.9,1.0,"It means 'to spy, to sneak' ..."
...,...,...,...,...,...,...,...
110129,all good things...,speech,PICARD,25.0,8.5,7.0,"No, no... in fact, I think I'll go back to be..."
110163,all good things...,speech,PICARD,25.0,8.5,7.0,No. I just thought I might... join you this e...
110168,all good things...,speech,PICARD,25.0,8.5,7.0,Oh... thank you.
110170,all good things...,speech,PICARD,25.0,8.5,7.0,I should have done this a long time ago. I wa...


In [None]:
StarTrek_Cols.info()