# Neoscholar Machine Learning Tutorials
### Session 01. Introduction to Numpy, Pandas, Matplotlib

### Contents
1. Numpy
2. Pandas
3. Matplotlib
4. EDA(Exploratory Data Analysis)

### Aim
At the end of this session, you will be able to:
- Understand the basics of numpy.
- Understand the basics of pandas.
- Understand the basics of matplotlib.
- Perform an Exploratory Data Analysis (EDA).


## 2. Pandas
Pandas is another essential open-source library in Python, and today it is widely used by data scientists and ML Engineers. It is built by Wes McKinney based on numpy. The name 'Pandas' is originated from the term "Panel Data", an econometrics term for datasets that include observations over multiple time periods for the same object.

### 2.1 Basics of Pandas

In [None]:
# run this shell if you haven't installed pandas library
! pip install pandas

In [None]:
import pandas as pd
import numpy as np

In [None]:
print(pd.__version__)

The main data structures of pandas are **Series** and **DataFrame**, where data are stored and manipulated. A `Series` can simply understood as a column and a `DataFrame` as a table that has many Series.

In [None]:
a = pd.Series([1, 2, 3, np.nan, 5, 6])

In [None]:
a

**Let's see how they are different!**

In [None]:
# Creating a Series using pandas.Series()
# This is just one of many ways to initialise a pandas series
module_score_dic = {'Database': 90, 'Security': 70, 'Math': 100, 'Machine Learning': 80}
module_score = pd.Series(module_score_dic)
print("Module_score: \n", module_score, '\n')
print("type: ", type(module_score), '\n')

# Creating a DataFrame using pandas DataFrame()
dataframe = pd.DataFrame(module_score, columns=['score'])
# dataframe = pd.DataFrame(module_score, index=[x for x in module_score.keys()], columns=['score'])
print("dataframe: \n", dataframe, '\n')
print("type: ", type(dataframe))

Series can also be a Dataframe that has only one attribute.  
**Now let's make a Dataframe that has multiple attributes**

In [None]:
solar_data = {
    'Name' : ["Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune"],
    'Satellite' : [0, 0, 1, 2, 79, 60, 27, 14],
    'AU' : [0.4, 0.7, 1, 1.5, 5.2, 9.5, 19.2, 30.1],
    'Diameter (in 1Kkm)' : [4.9, 12.1, 12.7, 6.8, 139.8, 116.5, 50.7, 49.2]
}

In [None]:
solar_system = pd.DataFrame(solar_data, index = [i for i in range(1, 9)])
solar_system

In [None]:
solar_system.dtypes # check data type

We can select what to read from the DataFrame by using methods that df.DataFrame has.

- `head()` : Returns the first n data
- `tail()` : Returns the last n data
- `index` : Returns the index
- `columns` : Returns the column
- `loc` : Returns the information of that row
- `values` : Returns only the values without index and column names
- `describe()` : Outputs satatistical summary a DataFrame
- `sort_values(self, by, axis = 0, ascending = True, inplace = False)` : Sort the DataFrame
- `drop()` : Drops selected row

In [None]:
df = solar_system

In [None]:
df.head() # the default value in the bracket is 5

In [None]:
df.tail(2)

In [None]:
df.index

In [None]:
df.index[0]

In [None]:
df.columns

In [None]:
# TODO: df.loc[0] gives you an error, but df.iloc[0] works fine. You see why? 
# Try playing around with different numbers and find out why before you google it!
# Google search keyword: df.loc vs df.iloc

df.loc[1]

In [None]:
df.iloc[0]

In [None]:
df.values

In [None]:
df.describe()

In [None]:
df.sort_values(by = 'Diameter', ascending = False)

In [None]:
# TO DO: re-sort the DataFrame by the number of satellite in descending order.
df.sort_values(None)

Before 2006, Pluto was classified as a planet of the solar system. Let's bring him back to our solar system, by adding Pluto to our DataFrame.

In [None]:
df.loc[9] = ["Pluto", 0, 39.5, 2.38]
df

Let's reclassify pluto as a dwarf planet again.

In [None]:
# To drop pluto, you either do df.drop(index=idx) or df.drop(df.index[idx])
df.drop(None)

### 2.2 Read Data with Pandas
Pandas supports loading, reading, and writing data from/to various file format, including CSV, JSON and SQL, by converting it to a DataFrame. 
1. `pd.read_csv()` : Read CSV files
2. `pd.read_json()` : Read JSON files
3. `pd.read_sql_query()` : Read SQL files

In [None]:
# TODO: Try each option, and see what the difference is.
# Option 1
movie = pd.read_csv("./data/IMDB-Movie-Data.csv", index_col = "Title")

#Option 2
# movie = pd.read_csv("./data/IMDB-Movie-Data.csv")
print(type(movie))
movie

In [None]:
# To Do: Extract third row
movie.iloc[2]
# movie.loc["Split"]
# movie.loc[2]    --> Is this going to work? if not, why not?

In [None]:
# To Do: Sort the table by Ratings, in descending order
# Do you agree with the rankings? :)
movie.None

In [None]:
# To Do: Sort the table by 'Revenue(Millions)', in ascending order and print the first 3 rows out
movie.None

In [None]:
# The value_counts() function is used to get a Series containing counts of unique values
movie['Genre'].value_counts().head()

In [None]:
# This is called a "Masking Operation"
# filter out movies that have runtime under 170 minutes and sort the result by rating in descending order.
movie[movie['Runtime (Minutes)'] >= 170].sort_values(by="Rating", ascending=False)

In [None]:
# To Do: By using masking operation, Extract the movies whose 'Metascore' is bigger than 95, and sort the result from the most recent to the least recent
None

In [None]:
# To Do: Extract movies whose directed by one of UCL Alumni ---> Hint: Tenet, Inception
None

#### 2.2.1 Pandas Exercise

To Do: Extract the movie list that meets requirements below:
- 1. Released after 2010 (key = 'Year') (including year 2010)
- 2. Runtime is shorter than 150 minutes (key = 'Runtime (Minutes)')
- 3. Rating is above 8.0 (key = 'Rating')  

Print out only the first 3 movies from the result.

In [None]:
None

### 2.3 How to deal with Missing Data
To represent missing data, pandas use np.nan. Data scientists and machine learning engineers sometimes just remove missing data. However, it heavily depends on which data are missing, how big the missing data are and so on. You can fill the missing part with 0, with the mean value of the column or with mean value of only 10 nearest value in the column. It is important for you to choose the way how you are going to deal with missing data.
- `isnull()`: returns True or False, depending on the cell's null status. 
- `sum()`: This can be used as a trick when you count the number of True's. Once the Dataframe is filtered through isnull() function, sum of all True's in a column gives you how many fields have missing data in them.
- `dropna()`: deletes any row that contains any single null value.
- `fillna(value)`: Fill missing value with the given values.

In [None]:
movie.isnull()

In [None]:
movie.isnull().sum()

In [None]:
movie.shape

In [None]:
# Take a look at "Take Me Home Tonight" and "Search Party"
movie.fillna(value = 0)

In [None]:
movie.dropna(inplace = True)

In [None]:
movie.shape

After dropping the rows that contain missing data, the shape of the dataFrame has changed, from (1000, 11) to (838, 11)

### 2.4 Merging Data
Some of you who know SQL might have felt that pandas is quite similar to query language.
What is the most popular thing that you do in most of the relational database query language?  
Yes! (terminologies alert!) Inner JOIN, Outer JOIN, Left JOIN, Right JOIN, Full JOIN...
- `concat()` : Concatenation. Used to merge two or more pandas object.
- `merge()` : Behaves very simlar to SQL.

We'll gonna create random dataframe, named df1 and df2

In [None]:
df1 = pd.DataFrame(np.random.randn(10, 2))
df1

In [None]:
df2 = pd.DataFrame(np.random.randn(10, 3))
df2

In [None]:
pd.concat([df1, df2])

In [None]:
pd.concat([df1, df2], axis = 1)     # axis setting is very common in pandas

In [None]:
demis = pd.DataFrame(
    {'Modules': ['Bioinformatics', 'Robotic Systems', 'Security', 'Compilers'], 'Demis' : [75, 97, 64, 81]}
)
demis

In [None]:
sedol = pd.DataFrame(
    {'Modules': ['Bioinformatics', 'Robotic Systems', 'Security', 'Compilers'], 'Sedol' : [63, 78, 84, 95]})
sedol

In [None]:
pd.merge(demis, sedol, on = 'Modules')

In [None]:
#To Do at home: Define your own dataframe and use functions introducesd above to concatenate or merge them.

In [None]:
# Your own trial code here!

### What to do next?
Below websites would be helpful for your further study on pandas library:
- [Pandas official website](https://pandas.pydata.org)
- [10 minutes to Pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html)
- [Data Wrangling with Pandas Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)