# Creating selections and subsets of your data

### There are many ways to get selections or subsets of your data:
#### - selecting a column with `df['averageRating']`
#### - selecting multiple columns using a list: `df[['tconst', 'averageRating']]`
#### - selecting a subset using a condition: `df[df['averageRating'] > 9.0]`
#### - using `.query("averageRating > 0")`

### Let's first read in our data again

In [3]:
import pandas as pd
pd.options.display.max_columns = 50

df = pd.read_csv('./data/most_voted_titles_enriched.csv')

In [4]:
df.head(3)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,genre1,genre2,genre3,url,averageRating,numVotes,metascore,country,primary_language,color,budget,opening_weekend_usa,gross_usa,cumulative_worldwide,tagline,summary,image_url
0,tt0010323,movie,The Cabinet of Dr. Caligari,Das Cabinet des Dr. Caligari,0,1920,,76.0,"Fantasy,Horror,Mystery",Fantasy,Horror,Mystery,https://www.imdb.com/title/tt0010323,8.1,57097,,Germany,,Black and White,"$18,000",,"$8,811","$8,811",You must become Caligari.,"Hypnotist Dr. Caligari uses a somnambulist, Ce...",https://m.media-amazon.com/images/M/MV5BNWJiNG...
1,tt0012349,movie,The Kid,The Kid,0,1921,,68.0,"Comedy,Drama,Family",Comedy,Drama,Family,https://www.imdb.com/title/tt0012349,8.3,112377,,USA,,Black and White,"$250,000",,,"$26,916",This is the great film he has been working on ...,"The Tramp cares for an abandoned child, but ev...",https://m.media-amazon.com/images/M/MV5BZjhhMT...
2,tt0013442,movie,Nosferatu,"Nosferatu, eine Symphonie des Grauens",0,1922,,94.0,"Fantasy,Horror",Fantasy,Horror,,https://www.imdb.com/title/tt0013442,7.9,88440,,Germany,,Black and White,,,,"$19,054",A thrilling mystery masterpiece - a chilling p...,Vampire Count Orlok expresses interest in a ne...,https://m.media-amazon.com/images/M/MV5BMTAxYj...


## Let's say we only want 1 column. How do we do that? Here are 2 ways:

### 1. Specifying the column you want: let's say we want to only look at the startYear column

In [None]:
df['startYear']

### Specifying only 1 column gives you a Series

In [None]:
type(df['startYear'])

### 2. The column names are also attributes, so you also use the dot notation

In [None]:
df.startYear

### So selecting multiple columns can be done by using a list

In [None]:
columns_needed = ['tconst', 'averageRating', 'startYear']

df[columns_needed]

### Let's say you only want titles with an average rating greater than 9.0. We need to use boolean vectors:

In [None]:
df['averageRating'] > 9.0

In [None]:
df[df['averageRating'] > 9.0].head(3)

### But we want multiple conditions: average rating greater than 9 AND only movies:

In [None]:
(df['titleType'] == 'movie')

In [None]:
(df['averageRating'] > 9.0)

In [None]:
df[(df['titleType'] == 'movie') & (df['averageRating'] > 9.0)]

### But this gets tedious, so I myself prefer to use the dataframe method `.query()`

In [None]:
df.query("titleType == 'movie' and averageRating > 9")

### One handy way of selecting strings still is using `.isin()`

In [None]:
df[df['genre1'].isin(['Crime', 'Drama'])]

### Ok, ok, just one more thing: if you want to find a string in a text, you can use `.str.contains('your_text', case=False)`

In [6]:
df[df['originalTitle'].str.contains('godfather', case=False)]

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,genre1,genre2,genre3,url,averageRating,numVotes,metascore,country,primary_language,color,budget,opening_weekend_usa,gross_usa,cumulative_worldwide,tagline,summary,image_url
316,tt0068646,movie,The Godfather,The Godfather,0,1972,,175.0,"Crime,Drama",Crime,Drama,,https://www.imdb.com/title/tt0068646,9.2,1608367,100.0,USA,English,Color,"$6,000,000","$302,393,","$134,966,411","$246,120,986",An offer you can't refuse.,The aging patriarch of an organized crime dyna...,https://m.media-amazon.com/images/M/MV5BM2MyNj...
354,tt0071562,movie,The Godfather: Part II,The Godfather: Part II,0,1974,,202.0,"Crime,Drama",Crime,Drama,,https://www.imdb.com/title/tt0071562,9.0,1122882,90.0,USA,English,Color,"$13,000,000","$171,417,","$47,834,595","$48,035,783",All the power on earth can't change destiny.,The early life and career of Vito Corleone in ...,https://m.media-amazon.com/images/M/MV5BMWMwMG...
912,tt0099674,movie,The Godfather: Part III,The Godfather: Part III,0,1990,,162.0,"Crime,Drama",Crime,Drama,,https://www.imdb.com/title/tt0099674,7.6,357561,60.0,USA,English,Color,"$54,000,000","$6,387,271,","$66,761,392","$136,766,062",Real power can't be given. It must be taken.,"Follows Michael Corleone, now in his 60s, as h...",https://m.media-amazon.com/images/M/MV5BNWFlYW...
