# Star Trek TNG - Master Clean Up 
---
## INTRODUCTION

- What is Star Trek? (a short history lesson)

- Goal: Determine the best roster of characters using NLP and sentiment analysis to predict the highest rating on IMDB. 
    - Elaborate on applications in the real world. 

- Data Acquistition (link to GitHub) 

- Goal in this notebook is to clean the dataset, and preprocess it for modeling, along with some handy visualizations. 



   

In [1]:
import pandas as pd 
import numpy as np 
import matplotlib as plt
import seaborn as sns
import dataframe_image as dfi

In [2]:
# read in the dataset
startrek_data = pd.read_csv(r'C:\Users\Katya\Documents\data\TNG.csv.gz', encoding= 'latin-1')

In [3]:
# taking a quick peek at the data
startrek_data.shape

(110176, 18)

In [4]:
startrek_data.head()

Unnamed: 0.1,Unnamed: 0,episode,productionnumber,setnames,characters,act,scenenumber,scenedetails,partnumber,type,who,text,speechdescription,Released,Episode,imdbRating,imdbID,Season
0,1,encounter at farpoint,,,,TEASER,1,,1,description,,The U.S.S. Enterprise NCC 1701-D traveling at...,False,1987-09-26,1.0,6.9,tt0094030,1.0
1,2,encounter at farpoint,,,,TEASER,1,,2,speech,PICARD V.O.,"Captain's log, stardate 42353.7. Our destinat...",False,1987-09-26,1.0,6.9,tt0094030,1.0
2,3,encounter at farpoint,,,,TEASER,2,,3,description,,on the gigantic new Enterprise NCC 1701-D.,False,1987-09-26,1.0,6.9,tt0094030,1.0
3,4,encounter at farpoint,,,,TEASER,2,,4,speech,PICARD V.O.,"My orders are to examine Farpoint, a starbase...",False,1987-09-26,1.0,6.9,tt0094030,1.0
4,5,encounter at farpoint,,,,TEASER,3,,5,description,,"Huge, with a giant wall diagram showing the i...",False,1987-09-26,1.0,6.9,tt0094030,1.0


In [5]:
# filtered to only rows where a charactor spoke in the episode
startrekspeech = startrek_data.loc[startrek_data['type'] == 'speech']

# take a look
startrekspeech.head()

Unnamed: 0.1,Unnamed: 0,episode,productionnumber,setnames,characters,act,scenenumber,scenedetails,partnumber,type,who,text,speechdescription,Released,Episode,imdbRating,imdbID,Season
1,2,encounter at farpoint,,,,TEASER,1,,2,speech,PICARD V.O.,"Captain's log, stardate 42353.7. Our destinat...",False,1987-09-26,1.0,6.9,tt0094030,1.0
3,4,encounter at farpoint,,,,TEASER,2,,4,speech,PICARD V.O.,"My orders are to examine Farpoint, a starbase...",False,1987-09-26,1.0,6.9,tt0094030,1.0
5,6,encounter at farpoint,,,,TEASER,3,,6,speech,PICARD V.O.,"acquainted with my new command, this Galaxy C...",True,1987-09-26,1.0,6.9,tt0094030,1.0
7,8,encounter at farpoint,,,,TEASER,4,,8,speech,PICARD V.O.,I am still somewhat in awe of its size and co...,False,1987-09-26,1.0,6.9,tt0094030,1.0
9,10,encounter at farpoint,,,,TEASER,5,,10,speech,PICARD V.O.,"several key positions, most notably ...",True,1987-09-26,1.0,6.9,tt0094030,1.0


In [6]:
StarTrek_Cols = startrekspeech.filter(['episode', 'type', 'who', 'Episode', 'imdbRating', 'Season', 'text'], axis=1)


In [7]:
# remove leading whitespaces
StarTrek_Cols['who'] = StarTrek_Cols['who'].str.lstrip()

In [8]:
# remove (V.O.) 
StarTrek_Cols['who'] = StarTrek_Cols['who'].str.replace(r'PICARD\s\S*', '', regex=True)

In [9]:
# StarTrek_Cols['who']= StarTrek_Cols['who'].apply(lambda x: x.split('PICARD')[0])

In [10]:
# # remove whitespaces
# StarTrek_Cols['who'] = StarTrek_Cols['who'].str.replace(' ', '')

In [11]:
# checking the if regex worked
StarTrek_Cols['who'].unique().tolist()

['',
 'PICARD',
 'DATA',
 'TROI',
 'WORF',
 'CONN',
 '"Q" (ELIZABETHAN)',
 'MEDIC',
 '"Q" (MARINE CAPTAIN)',
 ' (MARINE CAPTAIN)  (MARINE CAPTAIN)',
 '"Q" (21ST CENTURY)',
 'TASHA',
 "PICARD'S INTERCOM VOICE",
 'MANDARIN-BAILIFF',
 'FUTURE MILITARY OFFICER',
 'MILITARY OFFICER',
 '"Q" (JUDGE)',
 '"Q" (JUDGE) MANDARIN-BAILIFF',
 'DATA "Q" (JUDGE)',
 'OPS',
 "RIKER'S VOICE",
 'ZORN',
 'RIKER',
 "ZORN ZORN (Cont'd)",
 'WESLEY',
 'BEVERLY',
 'MARKHAM',
 'GEORDI/MARKHAM',
 'GEORDI',
 'BANDI WOMAN',
 "DATA'S VOICE",
 'RIKER (V.O.)',
 'ADMIRAL',
 "PICARD'S VOICE",
 "TROI'S VOICE",
 'YOUNG ENSIGN',
 'COMPUTER VOICE',
 'RIKER COMPUTER VOICE',
 'RIKER DATA',
 "WESLEY'S VOICE",
 'BEVERLY WESLEY',
 "TASHA'S VOICE",
 'WES',
 'SECURITY VOICE',
 "ZORN'S VOICE",
 'SECURITY POSITION',
 'OPERATIONS POSITION',
 "ZORN'S VOICE PICARD",
 'INTERCOM VOICE',
 '"Q" (STARFLEET)',
 "WOMAN'S COM VOICE",
 'TRANSPORTER CHIEF',
 "TASHA'S COM VOICE",
 "RIKER'S COM VOICE",
 "BEVERLY'S COM VOICE",
 ' O.)',
 "TROI'S COM 

# To Do 

Filter the dataset to only have these characters:

- Main Characters:
    - Jean-Luc Picard
    - William Riker
    - Geordi La Forge
    - Tasha Yar
    - Worf
    - Beverly Crusher
    - Deanna Troi
    - Data
    - Wesley Crusher
    - Katherine Pulaski
    - Q

Remove comm voices, voice overs (V.O) and (O.S) and everything else

In [11]:
# create a list of characters (TEST)
characters = ['PICARD','WORF', 'BEVERLY', 'GEORDI', 'TROI', 'TASHA', 'DATA', 'WESLEY', 'PULASKI', 'Q', 'RIKER', 'LORE']

# filtering to only the characters in the list above
StarTrek_Final = StarTrek_Cols[StarTrek_Cols['who'].isin(characters)].copy()

In [12]:
# checking to see if the values are correct
StarTrek_Final['who'].unique()

array(['PICARD', 'DATA', 'TROI', 'WORF', 'TASHA', 'RIKER', 'WESLEY',
       'BEVERLY', 'GEORDI', 'LORE', 'PULASKI', 'Q'], dtype=object)

In [13]:
StarTrek_Final.shape

(44600, 7)

In [14]:
StarTrek_Cols.loc[StarTrek_Cols['who'].isin(characters)]

Unnamed: 0,episode,type,who,Episode,imdbRating,Season,text
12,encounter at farpoint,speech,PICARD,1.0,6.9,1.0,"You will agree, Data, that Starfleet's instru..."
13,encounter at farpoint,speech,DATA,1.0,6.9,1.0,Difficult ... how so? Simply solve the myster...
14,encounter at farpoint,speech,PICARD,1.0,6.9,1.0,As simple as that.
15,encounter at farpoint,speech,TROI,1.0,6.9,1.0,Farpoint Station. Even the name sounds myster...
16,encounter at farpoint,speech,PICARD,1.0,6.9,1.0,"The problem, Data, is that another life form ..."
...,...,...,...,...,...,...,...
110167,all good things...,speech,DATA,25.0,8.5,7.0,Would you care to deal?
110168,all good things...,speech,PICARD,25.0,8.5,7.0,Oh... thank you.
110170,all good things...,speech,PICARD,25.0,8.5,7.0,I should have done this a long time ago. I wa...
110171,all good things...,speech,TROI,25.0,8.5,7.0,You were always welcome.


In [15]:
StarTrek_Cols['episode'].unique().shape

(174,)

In [16]:
StarTrek_Final['who'].unique().tolist()

['PICARD',
 'DATA',
 'TROI',
 'WORF',
 'TASHA',
 'RIKER',
 'WESLEY',
 'BEVERLY',
 'GEORDI',
 'LORE',
 'PULASKI',
 'Q']

In [17]:
StarTrek_Cols.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 68487 entries, 1 to 110173
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   episode     68487 non-null  object 
 1   type        68487 non-null  object 
 2   who         68486 non-null  object 
 3   Episode     67739 non-null  float64
 4   imdbRating  67739 non-null  float64
 5   Season      67739 non-null  float64
 6   text        68487 non-null  object 
dtypes: float64(3), object(4)
memory usage: 4.2+ MB
