# Creating selections and subsets of your data

### There are many ways to get selections or subsets of your data:
#### - selecting a column with `df['averageRating']`
#### - selecting multiple columns using a list: `df[['tconst', 'averageRating']]`
#### - selecting a subset using a condition: `df[df['averageRating'] > 9.0]`
#### - using `.query("averageRating > 0")`

### Let's first read in our data again

In [None]:
df = pd.read_csv('most_voted_titles.csv')

In [None]:
df.head(3)

## Let's say we only want 1 column. How do we do that? Here are 2 ways:

### 1. Specifying the column you want: let's say we want to only look at the startYear column

In [None]:
df['startYear']

### Specifying only 1 column gives you a Series

In [None]:
type(df['startYear'])

### 2. The column names are also attributes, so you also use the dot notation

In [None]:
df.startYear

### So selecting multiple columns can be done by using a list

In [None]:
columns_needed = ['tconst', 'averageRating', 'startYear']

df[columns_needed]

### Let's say you only want titles with an average rating greater than 9.0. We need to use boolean vectors:

In [None]:
df['averageRating'] > 9.0

In [None]:
df[df['averageRating'] > 9.0].head(3)

### But we want multiple conditions: average rating greater than 9 AND only movies:

In [None]:
(df['titleType'] == 'movie')

In [None]:
(df['averageRating'] > 9.0)

In [None]:
df[(df['titleType'] == 'movie') & (df['averageRating'] > 9.0)]

### But this gets tedious, so I myself prefer to use the dataframe method `.query()`

In [None]:
df.query("titleType == 'movie' and averageRating > 9")

### One handy way of selecting strings still is using `.isin()`

In [None]:
df[df['genre1'].isin(['Crime', 'Drama'])]

### This is a bit off topic, but sorting your dataframe is also important. This can be done with `.sort_values()` Don't forget to inspect additional arguments of this function using `Shift + Tab` inside the function.

### 1. Let's first sort the dataframe on originalTitle using argument `by`

In [None]:
df.sort_values(by='originalTitle')

### 2. Now sort the whole list on title in descending order, using `ascending=False`

In [None]:
df.sort_values(by='originalTitle', ascending=False)

### 3. Or we can sort on multiple columns. Now we need to use lists:

In [None]:
df.sort_values(by=['startYear', 'runtimeMinutes'], ascending=[False, True])