# Star Trek TNG - Master Clean Up 
---
## INTRODUCTION

- What is Star Trek? (a short history lesson)

- Goal: Determine the best roster of characters using NLP and sentiment analysis to predict the highest rating on IMDB. 
    - Elaborate on applications in the real world. 

- Data Acquistition (link to GitHub) 

- Goal in this notebook is to clean the dataset, and preprocess it for modeling, along with some handy visualizations. 



   

In [1]:
import pandas as pd 
import numpy as np 
import matplotlib as plt
import seaborn as sns
import dataframe_image as dfi

In [2]:
# read in the dataset
startrek_data = pd.read_csv(r'C:\Users\Katya\Documents\data\TNG.csv.gz', encoding= 'latin-1')

In [3]:
# taking a quick peek at the data
startrek_data.head(5)

Unnamed: 0.1,Unnamed: 0,episode,productionnumber,setnames,characters,act,scenenumber,scenedetails,partnumber,type,who,text,speechdescription,Released,Episode,imdbRating,imdbID,Season
0,1,encounter at farpoint,,,,TEASER,1,,1,description,,The U.S.S. Enterprise NCC 1701-D traveling at...,False,1987-09-26,1.0,6.9,tt0094030,1.0
1,2,encounter at farpoint,,,,TEASER,1,,2,speech,PICARD V.O.,"Captain's log, stardate 42353.7. Our destinat...",False,1987-09-26,1.0,6.9,tt0094030,1.0
2,3,encounter at farpoint,,,,TEASER,2,,3,description,,on the gigantic new Enterprise NCC 1701-D.,False,1987-09-26,1.0,6.9,tt0094030,1.0
3,4,encounter at farpoint,,,,TEASER,2,,4,speech,PICARD V.O.,"My orders are to examine Farpoint, a starbase...",False,1987-09-26,1.0,6.9,tt0094030,1.0
4,5,encounter at farpoint,,,,TEASER,3,,5,description,,"Huge, with a giant wall diagram showing the i...",False,1987-09-26,1.0,6.9,tt0094030,1.0


In [4]:
# filtered to only rows where a charactor spoke in the episode
startrekspeech = startrek_data.loc[startrek_data['type'] == 'speech']

# take a look
startrekspeech.sample(3)

Unnamed: 0.1,Unnamed: 0,episode,productionnumber,setnames,characters,act,scenenumber,scenedetails,partnumber,type,who,text,speechdescription,Released,Episode,imdbRating,imdbID,Season
14443,14444,conspiracy,#40271-125,"USS ENTERPRISE,USS ENTERPRISE,MAIN BRIDGE,CAPT...","PICARD,CAPTAIN WALKER KEEL,RIKER,TRYLA SCOTT,B...",TEASER,7,,48,speech,KEEL,"Hello, Jean-Luc. Been a long time.",False,1988-05-07,24.0,8.1,tt0708691,1.0
83590,83591,ship in a bottle,#40276-238,"USS ENTERPRISE,USS ENTERPRISE,MAIN BRIDGE,TRAN...","PICARD,BARCLAY,RIKER,MORIARTY,DATA,COUNTESS,BE...",TWO,9,,156,speech,DATA,Data to Security... send two officers to Holo...,True,1993-01-23,12.0,8.5,tt0708773,6.0
83658,83659,ship in a bottle,#40276-238,"USS ENTERPRISE,USS ENTERPRISE,MAIN BRIDGE,TRAN...","PICARD,BARCLAY,RIKER,MORIARTY,DATA,COUNTESS,BE...",TWO,12,,224,speech,BARCLAY,"Even if we decided to do it, there's no guara...",False,1993-01-23,12.0,8.5,tt0708773,6.0


In [5]:
StarTrek_Cols = startrekspeech.filter(['episode', 'type', 'who', 'Episode', 'imdbRating', 'Season', 'text'], axis=1)

StarTrek_Cols.sample(10)

Unnamed: 0,episode,type,who,Episode,imdbRating,Season,text
25748,up the long ladder,speech,PULASKI,18.0,6.3,2.0,Out.
14758,conspiracy,speech,SAVAR,24.0,8.1,1.0,"Welcome home, Captain Picard."
96853,phantasms,speech,PICARD,6.0,7.6,7.0,Perhaps Starbase Eighty-four... could we have...
32708,the enemy,speech,BOCHRA,7.0,7.8,3.0,A Romulan ship will arrive shortly... you wil...
108845,all good things...,speech,PICARD,25.0,8.5,7.0,Twenty-five years... Time's been good to you.
48182,reunion,speech,K'MPEC,7.0,8.3,4.0,"No. By tradition, the two strongest challenge..."
2573,haven,speech,RIKER,10.0,6.2,1.0,I thought the Tarellians were finished! What ...
62333,disaster,speech,PICARD,5.0,7.8,5.0,Feel around the edge of the illumination modu...
108854,all good things...,speech,PICARD,25.0,8.5,7.0,How is Leah?
16631,where silence has lease,speech,WESLEY,2.0,7.1,2.0,Should I set a course?


In [9]:
# show a list of all the unique values within the 'who' column in order to see what needs to be cleaned. 
StarTrek_Cols['who'].unique().tolist()

[' PICARD V.O.',
 ' PICARD',
 ' DATA',
 ' TROI',
 ' WORF',
 ' CONN',
 ' "Q" (ELIZABETHAN)',
 ' MEDIC',
 ' "Q" (MARINE CAPTAIN)',
 ' PICARD "Q" (MARINE CAPTAIN) PICARD "Q" (MARINE CAPTAIN)',
 ' "Q" (21ST CENTURY)',
 ' TASHA',
 ' PICARD TROI',
 " PICARD'S INTERCOM VOICE",
 ' MANDARIN-BAILIFF',
 ' FUTURE MILITARY OFFICER',
 ' MILITARY OFFICER',
 ' "Q" (JUDGE)',
 ' "Q" (JUDGE) MANDARIN-BAILIFF',
 ' DATA "Q" (JUDGE)',
 ' OPS',
 " RIKER'S VOICE",
 ' ZORN',
 ' RIKER',
 " ZORN ZORN (Cont'd)",
 ' WESLEY',
 ' BEVERLY',
 ' MARKHAM',
 ' GEORDI/MARKHAM',
 ' GEORDI',
 ' BANDI WOMAN',
 " DATA'S VOICE",
 ' RIKER (V.O.)',
 ' ADMIRAL',
 " PICARD'S VOICE",
 ' PICARD WORF',
 " TROI'S VOICE",
 ' PICARD ZORN',
 ' YOUNG ENSIGN',
 ' COMPUTER VOICE',
 ' RIKER COMPUTER VOICE',
 ' RIKER DATA',
 " WESLEY'S VOICE",
 ' PICARD WES',
 ' BEVERLY WESLEY',
 " TASHA'S VOICE",
 ' WES',
 ' SECURITY VOICE',
 " ZORN'S VOICE",
 ' SECURITY POSITION',
 ' OPERATIONS POSITION',
 " ZORN'S VOICE PICARD",
 ' INTERCOM VOICE',
 ' "Q" 

# To Do 

- Decide which characters to keep (the whole cast + recurring members) 
- Start the NLP process 