## Blended learning: 
Watch the following video and then read the information in this file and do the excercises. 
Video link: https://youtu.be/LHBE6Q9XlzI?t=39852

In [None]:
import pandas as pd

## Series object
#### Can be indexed by other types than integer, have nonsequential indices or even floats.

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], 
                 index=['a','b','c','d'])
print('string index')
print(data)

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], 
                 index=[25, 0, 100, 56])
print('nonsequential indices')
print(data)

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], 
                 index=[2.5, 0.0, 10.0, 5.6])
print('float indices')
print(data)

#### We have several methods to explore the data.

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], 
                 index=['a','b','c','d'])
list(data.items())

In [None]:
data.keys()

In [None]:
data.values

In [None]:
# Try to figure out what the following code does before running it.
if not 'e' in data:
    data['e'] = 1.25
data

In [None]:
# Fancy indexing
data[['b','d', 'e']]

#### Implicit and explicit indexing

In [None]:
data = pd.Series({2:'a', 4:'b', 1:'c', 3:'d'})
data

In [None]:
# Explicit index
print(data[2], data.loc[2])

In [None]:
# Implicit index
print(data[2:3], '\n', data.iloc[2])

### Like a dict
The series object is a bit like a dictionary in that it maps typed keys to a set of typed values. (Both keys and values in a dictionary can be arbitrary.) Can create a Series from a python dictionary.

In [None]:
population_dict = {'California': 38332521,
                   'Texas':      26448193,
                   'New York':   19651127,
                   'Florida':    19552860,
                   'Illinois':   12882135}
population = pd.Series(population_dict)
population

Normal value retrieval is used (just as in dictionaries) and to get a list of the indexes available we use *.index*. Note that this returns the same type of object as *.keys()*.

In [None]:
print(population['California'])
print(population.index)

The Series also supports array-style operations such as slicing. Return all populations between Texas and Florida.

In [None]:
population['Texas':'Florida']

## Dataframe
A dataframe can be created from a Series object, or several Series objects.

In [None]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

In [None]:
states = pd.DataFrame({'population': population,
                       'area': area})
states

In [None]:
# Using column attribute, but if the column name can be confused with 
# a method, use the explicit dictionary-style access instead.
a_pop = states.population
ad_pop = states['population']
print(a_pop == ad_pop, '\n')
print('Column attribute is the same as dict-style access: ',a_pop is ad_pop)

#### Now that we have several columns, we can print both the row and column info. And, just as in a dict, we can access values in a column via its name.

In [None]:
print(states.index)
print(states.columns, '\n')
print(states['area'])

In [None]:
# Can add a new column in the same way as in a dictionary.
states['density'] = states['population'] / states['area']
states

#### How would you obtain the area and density for the first three rows?
Do you prefer loc or iloc?

In [None]:
# implicit indexing of values gives us the first row
states.values[0]

In [None]:
# index location
states.iloc[:3, 1:3]

In [None]:
# Note that it includes the given "locations"
states.loc[:'New York', 'area':]

#### Display the population and density for places where density is over 50 and under 100.

#### Any list of dictionaries can be made into a DataFrame, and any missing keys will be filled by Pandas as NaN. 

In [None]:
data = [{'a': i, 'b':2*i} for i in range(3)]
pd.DataFrame(data)

In [None]:
data.append({'a':8, 'c':9})
pd.DataFrame(data)

#### Why did the columns change type?

#### Add another row to *states* with the state of Alaska. Alaska has a population of 1723337.

#### Change Alaskas area and density to be 0.

#### Create an array of the same shape and size as states, all values to be 0.20. 
Multiply states by this array, then add the resulting array to states (save in states)

#### Increase all values by 10% as this data is now outdated.

#### Look up the  *transpose* option. What does it do? Check with states.

### Exercises

1. Create a 20x20 array with only zeros. Set the first and last column to be float, the rest int.
2. Change the first and last column to be ones.
3. Create a dataframe with these numbers, index by the alphabet and name columns by numbers.

## Answer the following questions using the *movies* dataset
Use the reference page if needed: https://pandas.pydata.org/docs/reference/frame.html

In [None]:
movie_data = '/data/movie.csv'
movies = pd.read_csv(movie_data)
movies.head()

#### Does this dataset contain any duplicate rows? Drop one of the duplicates in that case. 

#### What data types are present in the dataset?   

#### Retrieve the "director_name" column by using 
1. the string as indexing operator
2. the attribute
3. loc 
4. iloc


#### What does the method .value_counts() do?

#### How many rows in the director_name column have NaN values?

#### Remove all rows with NaN values in the director_name column.

# Statistics and data
I mentioned some basic statistics last week when I talked about normalisation and standardisation, and I showed some graphs. The following video shows some more graph options and discusses some more statistical ideas that are necessary to understand for machine learning. Stop the video when the time reaches **5:13:50**. 
Video link: https://youtu.be/ua-CiDNNj30?t=17042

Answer the following questions.  
1. We often use statistics when performing an exploratory data analysis (EDA). Why?
2. What different types of graphs are there?
3. Why would you use graphics to explore data?
4. What kind of distributions are there? 
5. What does the distribution tell us about the data?