# Python for Data Analysis
1. Setting Up Necessary Things
2. Python Libraries for Data Science
    1. NumPy Library
    2. SciPy Library
    3. Pandas Library
    4. Scikit-learn Library
    5. Matplotlib Library
    6. Seaborn Library
3. Python Libraries Practical
    1. Loading Python Libraries
    2. Reading Data Using Pandas
    3. Exploring DataFrame
4. DataFrame
    1. DataFrame Data Types
    2. DataFrame Attributes
    3. DataFrame Methods
    4. Selecting a Column in a DataFrame
        1. Description of a column
    5. DataFrame groupby Method
    6. DataFrame: Filtering
    7. DataFrame: Slicing
        1. Selecting Rows
        2. Method loc
        3. Method iloc
    8. DataFrame: Sorting
    9. Missing Values
5. Aggregation Functions in Pandas
    1. Basic Descriptive Statistics
6. Graphics to Explore the Data
    1. Graphics
7. Basic Statistical Analysis

# 1. Setting Up Necessary Things

In [1]:
# Ignore All Warnings
import warnings
warnings.filterwarnings("ignore")

# 2. Python Libraries for Data Science

Many popular Python libraries:
1. [NumPy](https://numpy.org/)
2. [SciPy](https://scipy.org/)
3. [Pandas](https://pandas.pydata.org/)
4. [Scikit-learn](https://scikit-learn.org/)

`Visualization` libraries:
1. [Matplotlib](https://matplotlib.org/)
2. [Seaborn](https://seaborn.pydata.org/)

*and many more...*

**All these libraries are installed on the [SCC](https://www.alibabacloud.com/product/scc).**

## 2.1. NumPy Library
Link: [NumPy](https://numpy.org/)

* Introduces objects for `multidimensional arrays and matrices`, as well as functions that allow to easily perform advanced mathematical and statistical operations on those objects.
* Provides `vectorization of mathematical operations` on arrays and matrices which significantly improves the performance.
* Many other Python libraries are built on NumPy.

## 2.2. SciPy Library
Link: [SciPy](https://scipy.org/)
* collection of algorithms for `linear algebra`, `differential equations`, `numerical integration`, `optimization`, `statistics` and more.
* Part of the SciPy Stack.
* built on `NumPy`.

## 2.3. Pandas Library
Link: [Pandas](https://pandas.pydata.org/)
* Adds **data structures** and **tools** designed to work with `tabular data` (similar to Series and DataFrame in [R](https://www.r-project.org/)).
* Provides tools for **data manipulation**: `Reshaping`, `Merging`, `Sorting`, `Slicing`, `Aggregation` etc.
* Allows handling `missing data`.

## 2.4. Scikit-learn Library
Link: [Scikit-learn](https://scikit-learn.org/)
* Provides **machine learning** algorithms: `Classification`, `Regression`, `Clustering`, `Model validation` etc.
* Built on `NumPy`, `SciPy` and `Matplotlib`.

## 2.5. Matplotlib Library
Link: [Matplotlib](https://matplotlib.org/)
* Python `2D plotting library` which produces publication quality figures in a variety of hardcopy formats.
* A set of functionalities similar to those of [MATLAB](https://www.mathworks.com/products/matlab.html).
* `Line Plots`, `Scatter Plots`, `Bar Charts`, `Histograms`, `Pie Charts` etc.
* Relatively Low-level, some efforts needed to create advanced visualization.

## 2.6. Seaborn Library
Link: [Seaborn](https://seaborn.pydata.org/)
* Based on `Matplotlib`.
* Provides a high-level interface for drawing attractive statistical graphics.
* Similar (in style) to the popular [ggplot2](https://ggplot2.tidyverse.org/) library in [R](https://www.r-project.org/).


# 3. Python Libraries Practical

## 3.1. Loading Python Libraries

In [2]:
# Import Python Libraries
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as plt
import seaborn as sns

## 3.2. Reading Data Using Pandas

In [3]:
# Read CSV File
df = pd.read_csv("../../data/salary_data.csv")

*Note: The above command has many optional arguments to fine-tune the data import process...*

There is a number of `Pandas` commands to read other data formats:
```python
pd.read_excel("<my-file>.xlsx", sheet_name = "<sheet-name>", index_col = None, na_vlalues = ["NA"])

pd.read_stata("<my-file>.dta")

pd.read_sas("<my-file>.sas7bdat")

pd.read_hdf("<my-file>.h5", "df")
```

## 3.3. Exploring DataFrame

In [4]:
# First 5 Records
df.head()

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
0,32.0,Male,Bachelor's,Software Engineer,5.0,90000.0
1,28.0,Female,Master's,Data Analyst,3.0,65000.0
2,45.0,Male,PhD,Senior Manager,15.0,150000.0
3,36.0,Female,Bachelor's,Sales Associate,7.0,60000.0
4,52.0,Male,Master's,Director,20.0,200000.0


In [5]:
# First 10 Records
df.head(10)

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
0,32.0,Male,Bachelor's,Software Engineer,5.0,90000.0
1,28.0,Female,Master's,Data Analyst,3.0,65000.0
2,45.0,Male,PhD,Senior Manager,15.0,150000.0
3,36.0,Female,Bachelor's,Sales Associate,7.0,60000.0
4,52.0,Male,Master's,Director,20.0,200000.0
5,29.0,Male,Bachelor's,Marketing Analyst,2.0,55000.0
6,42.0,Female,Master's,Product Manager,12.0,120000.0
7,31.0,Male,Bachelor's,Sales Manager,4.0,80000.0
8,26.0,Female,Bachelor's,Marketing Coordinator,1.0,45000.0
9,38.0,Male,PhD,Senior Scientist,10.0,110000.0


In [6]:
# ? Last 5 Records
df.tail()

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
370,35.0,Female,Bachelor's,Senior Marketing Analyst,8.0,85000.0
371,43.0,Male,Master's,Director of Operations,19.0,170000.0
372,29.0,Female,Bachelor's,Junior Project Manager,2.0,40000.0
373,34.0,Male,Bachelor's,Senior Operations Coordinator,7.0,90000.0
374,44.0,Female,PhD,Senior Business Analyst,15.0,150000.0


In [7]:
# ? Last 10 Records
df.tail(10)

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
365,43.0,Male,Master's,Director of Marketing,18.0,170000.0
366,31.0,Female,Bachelor's,Junior Financial Analyst,3.0,50000.0
367,41.0,Male,Bachelor's,Senior Product Manager,14.0,150000.0
368,44.0,Female,PhD,Senior Data Engineer,16.0,160000.0
369,33.0,Male,Bachelor's,Junior Business Analyst,4.0,60000.0
370,35.0,Female,Bachelor's,Senior Marketing Analyst,8.0,85000.0
371,43.0,Male,Master's,Director of Operations,19.0,170000.0
372,29.0,Female,Bachelor's,Junior Project Manager,2.0,40000.0
373,34.0,Male,Bachelor's,Senior Operations Coordinator,7.0,90000.0
374,44.0,Female,PhD,Senior Business Analyst,15.0,150000.0


# 4. DataFrame

## 4.1. DataFrame Data Types

| Pandas Type                   | Native Python Type                                               | Description                                                                                                                                        |
|-------------------------------|------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------|
| `object`                      | `string`                                                         | The most general dtype. Will be assigned to your column if column has mixed types (numbers and strings).                                           |
| `int64`                       | `int`                                                            | Numeric characters. 64 refers to the memory allocated to hold this character.                                                                      |
| `float64`                     | `float`                                                          | Numeric characters with decimals. If a column contains numbers and NaNs, pandas will default to float64, in case your missing value has a decimal. |
| `datetime64`, `timedelta[ns]` | `N/A` (but see the datetime module in Python’s standard library) | Values meant to hold time data. Look into these for time series experiments.                                                                       |

In [8]:
# Check a particular column type
df["Salary"].dtype

dtype('float64')

In [9]:
# Check types for all the columns
df.dtypes

Age                    float64
Gender                  object
Education Level         object
Job Title               object
Years of Experience    float64
Salary                 float64
dtype: object

## 4.2. DataFrame Attributes
Python objects have *attributes*:

| DataFrame Attribute | Description                                       |
|---------------------|---------------------------------------------------|
| `dtypes`            | List the *types of the columns*.                  |
| `columns`           | List the *column names*.                          |
| `axes`              | List the *row labels* and *columns names*.        |
| `ndim`              | Number of *dimensions*.                           |
| `size`              | Number of *elements*.                             |
| `shape`             | Return a tuple representing the *dimensionality*. |
| `values`            | NumPy representation of the *data*.               |

In [10]:
# Total number of records
df.shape[0]

375

In [11]:
# Total number of elements
df.size

2250

In [12]:
# Column names
df.columns

Index(['Age', 'Gender', 'Education Level', 'Job Title', 'Years of Experience',
       'Salary'],
      dtype='object')

In [13]:
# All columns data types
df.dtypes

Age                    float64
Gender                  object
Education Level         object
Job Title               object
Years of Experience    float64
Salary                 float64
dtype: object

## 4.3. DataFrame Methods

| DataFrame Method         | Description                                                   |
|--------------------------|---------------------------------------------------------------|
| `head([n])`, `tail([n])` | *first*/*last* n rows.                                        |
| `describe()`             | Generate descriptive *statistics* (for numeric columns only). |
| `max()`, `min()`         | Return *max*/*min* values for all numeric columns.            |
| `mean()`, `median()`     | Return *mean*/*median* values for all numeric columns.        |
| `std()`                  | *Standard Deviation*.                                         |
| `sample([n])`            | Returns a *random sample* of the DataFrame.                   |
| `dropna()`               | *Drop* all the records with *missing values*.                 |

*All attributes and methods can be listed with a `dir()` function ...*
```python
dir(df)
```

In [14]:
# Summary Description of numeric columns
df.describe()

Unnamed: 0,Age,Years of Experience,Salary
count,373.0,373.0,373.0
mean,37.431635,10.030831,100577.345845
std,7.069073,6.557007,48240.013482
min,23.0,0.0,350.0
25%,31.0,4.0,55000.0
50%,36.0,9.0,95000.0
75%,44.0,15.0,140000.0
max,53.0,25.0,250000.0


In [15]:
# Standard Deviation
df.std()

Age                        7.069073
Years of Experience        6.557007
Salary                 48240.013482
dtype: float64

In [16]:
# Mean values of the first 50 records
df[:50].mean()

Age                       36.28
Years of Experience        9.02
Salary                 98100.00
dtype: float64

## 4.4. Selecting a Column in a DataFrame

In [17]:
# Method 1: Subset the DataFrame using column name
df["Gender"]

0        Male
1      Female
2        Male
3      Female
4        Male
        ...  
370    Female
371      Male
372    Female
373      Male
374    Female
Name: Gender, Length: 375, dtype: object

In [18]:
# Method 2: Use the column name as an attribute
df.Gender

0        Male
1      Female
2        Male
3      Female
4        Male
        ...  
370    Female
371      Male
372    Female
373      Male
374    Female
Name: Gender, Length: 375, dtype: object

### 4.4.1. Description of a column

In [19]:
# Basic Statistics of the Salary column
df["Salary"].describe()

count       373.000000
mean     100577.345845
std       48240.013482
min         350.000000
25%       55000.000000
50%       95000.000000
75%      140000.000000
max      250000.000000
Name: Salary, dtype: float64

In [20]:
# Values count of the Salary column
df["Salary"].count()

373

In [21]:
# Average salary from the Salary column
df["Salary"].mean()

100577.34584450402

## 4.5. DataFrame groupby Method

Using `groupby` method we can:
* Split the data into groups based on some criteria.
* Calculate statistics or apply a function to each group.
* Similar to `dplyr()` function in [R](https://www.r-project.org/).

In [22]:
# Group data using Education Level
df_education_level = df.groupby(["Education Level"])

In [23]:
# Calculate mean value for each numeric column per each group
df_education_level.mean()

Unnamed: 0_level_0,Age,Years of Experience,Salary
Education Level,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Bachelor's,34.3125,6.966518,74756.026786
Master's,40.765306,13.397959,129795.918367
PhD,44.72549,17.019608,157843.137255


Once `groupby` object is created we can calculate various statistics for each group ...

In [24]:
# Calculate mean Salary for each Education Level
df.groupby("Education Level")[["Salary"]].mean()

Unnamed: 0_level_0,Salary
Education Level,Unnamed: 1_level_1
Bachelor's,74756.026786
Master's,129795.918367
PhD,157843.137255



*Note: If `single brackets` are used to specify the column, then the output is `Pandas Series object`. When `double brackets` are used the output is a `DataFrame`.*

By default, the group keys are sorted during the `groupby` operation. You may want to pass `sort = False` for potential speedup.

In [25]:
# Calculate mean Salary for each Education Level
df.groupby(["Education Level"], sort = False)[["Salary"]].mean()

Unnamed: 0_level_0,Salary
Education Level,Unnamed: 1_level_1
Bachelor's,74756.026786
Master's,129795.918367
PhD,157843.137255


## 4.6. DataFrame: Filtering
To subset the data, we can apply Boolean indexing. This indexing is commonly known as a filter.

**Example:** *Subset the rows in which the salary value is greater than $120K.*

In [26]:
# Salary greater than $120K
df_greater_salary = df[df["Salary"] > 120000]
df_greater_salary.head()

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
2,45.0,Male,PhD,Senior Manager,15.0,150000.0
4,52.0,Male,Master's,Director,20.0,200000.0
11,48.0,Female,Bachelor's,HR Manager,18.0,140000.0
13,40.0,Female,Master's,Project Manager,14.0,130000.0
15,44.0,Male,Bachelor's,Operations Manager,16.0,125000.0


*Any Boolean operator can be used to subset the data...*

`> greater`

`>= greater or equal`

`< less`

`<= less or equal`

`== e`

`!= not equal`

In [27]:
# Select only those rows that contain only female
df_female = df[df["Gender"] == "Female"]
df_female.head()

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
1,28.0,Female,Master's,Data Analyst,3.0,65000.0
3,36.0,Female,Bachelor's,Sales Associate,7.0,60000.0
6,42.0,Female,Master's,Product Manager,12.0,120000.0
8,26.0,Female,Bachelor's,Marketing Coordinator,1.0,45000.0
11,48.0,Female,Bachelor's,HR Manager,18.0,140000.0


## 4.7. DataFrame: Slicing
There are a number of ways to subset the DataFrame:
* One or more columns
* One or more rows
* A subset of rows and columns

*Rows and columns can be selected by their position or label ...*

*When selecting one column, it is possible to use single set of brackets, but the resulting object will be a Series, not a DataFrame.*

In [28]:
# Select a single column with a single set of brackets
df["Salary"]

0       90000.0
1       65000.0
2      150000.0
3       60000.0
4      200000.0
         ...   
370     85000.0
371    170000.0
372     40000.0
373     90000.0
374    150000.0
Name: Salary, Length: 375, dtype: float64

In [29]:
# Select a single column with a double set of brackets
df[["Salary"]]

Unnamed: 0,Salary
0,90000.0
1,65000.0
2,150000.0
3,60000.0
4,200000.0
...,...
370,85000.0
371,170000.0
372,40000.0
373,90000.0


*Let's see what are the type of objects: with a single set of brackets and double set of brackets...*

In [30]:
type(df["Salary"])

pandas.core.series.Series

In [31]:
type(df[["Salary"]])

pandas.core.frame.DataFrame

*When we need to select more than one column and/or make the output to be a DataFrame, we should use double brackets...*

In [32]:
# Select columns Education Level and Salary
df[["Education Level", "Salary"]]

Unnamed: 0,Education Level,Salary
0,Bachelor's,90000.0
1,Master's,65000.0
2,PhD,150000.0
3,Bachelor's,60000.0
4,Master's,200000.0
...,...,...
370,Bachelor's,85000.0
371,Master's,170000.0
372,Bachelor's,40000.0
373,Bachelor's,90000.0


### 4.7.1. Selecting Rows
If we need to select a range of rows, we can specify the range using `:`

In [33]:
# Select rows by their position
df[10:20]

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
10,29.0,Male,Master's,Software Developer,3.0,75000.0
11,48.0,Female,Bachelor's,HR Manager,18.0,140000.0
12,35.0,Male,Bachelor's,Financial Analyst,6.0,65000.0
13,40.0,Female,Master's,Project Manager,14.0,130000.0
14,27.0,Male,Bachelor's,Customer Service Rep,2.0,40000.0
15,44.0,Male,Bachelor's,Operations Manager,16.0,125000.0
16,33.0,Female,Master's,Marketing Manager,7.0,90000.0
17,39.0,Male,PhD,Senior Engineer,12.0,115000.0
18,25.0,Female,Bachelor's,Data Entry Clerk,0.0,35000.0
19,51.0,Male,Bachelor's,Sales Director,22.0,180000.0


*Notice that the first row has position 0, and the last value in the range is omitted. So for 0:10 range the first 10 rows are returned with the positions starting with 0 and ending with 9.*

### 4.7.2. Method loc
If we need to select a range of rows, using their labels we can use method `loc`:

In [34]:
# Select rows by their labels Education Level, Gender and Salary
df.loc[10:20, ["Education Level", "Gender", "Salary"]]

Unnamed: 0,Education Level,Gender,Salary
10,Master's,Male,75000.0
11,Bachelor's,Female,140000.0
12,Bachelor's,Male,65000.0
13,Master's,Female,130000.0
14,Bachelor's,Male,40000.0
15,Bachelor's,Male,125000.0
16,Master's,Female,90000.0
17,PhD,Male,115000.0
18,Bachelor's,Female,35000.0
19,Bachelor's,Male,180000.0


### 4.7.3. Method iloc
If we need to select a range of rows and/or columns, using their positions we can use method `iloc`:

In [35]:
# Select rows by their positions
df.iloc[10:20, [0, 2, 4]]

Unnamed: 0,Age,Education Level,Years of Experience
10,29.0,Master's,3.0
11,48.0,Bachelor's,18.0
12,35.0,Bachelor's,6.0
13,40.0,Master's,14.0
14,27.0,Bachelor's,2.0
15,44.0,Bachelor's,16.0
16,33.0,Master's,7.0
17,39.0,PhD,12.0
18,25.0,Bachelor's,0.0
19,51.0,Bachelor's,22.0


*more...*

In [36]:
# First row of the DataFrame
df.iloc[0]

Age                                 32.0
Gender                              Male
Education Level               Bachelor's
Job Title              Software Engineer
Years of Experience                  5.0
Salary                           90000.0
Name: 0, dtype: object

In [37]:
# (i + 1)th row of the DataFrame
i = 3

df.iloc[i]

Age                               36.0
Gender                          Female
Education Level             Bachelor's
Job Title              Sales Associate
Years of Experience                7.0
Salary                         60000.0
Name: 3, dtype: object

In [38]:
# Last row of the DataFrame
df.iloc[-1]

Age                                       44.0
Gender                                  Female
Education Level                            PhD
Job Title              Senior Business Analyst
Years of Experience                       15.0
Salary                                150000.0
Name: 374, dtype: object

In [39]:
# First column of the DataFrame
df.iloc[:, 0]

0      32.0
1      28.0
2      45.0
3      36.0
4      52.0
       ... 
370    35.0
371    43.0
372    29.0
373    34.0
374    44.0
Name: Age, Length: 375, dtype: float64

In [40]:
# Last column of the DataFrame
df.iloc[:, -1]

0       90000.0
1       65000.0
2      150000.0
3       60000.0
4      200000.0
         ...   
370     85000.0
371    170000.0
372     40000.0
373     90000.0
374    150000.0
Name: Salary, Length: 375, dtype: float64

In [41]:
# First 7 rows of the DataFrame
df.iloc[0:7]

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
0,32.0,Male,Bachelor's,Software Engineer,5.0,90000.0
1,28.0,Female,Master's,Data Analyst,3.0,65000.0
2,45.0,Male,PhD,Senior Manager,15.0,150000.0
3,36.0,Female,Bachelor's,Sales Associate,7.0,60000.0
4,52.0,Male,Master's,Director,20.0,200000.0
5,29.0,Male,Bachelor's,Marketing Analyst,2.0,55000.0
6,42.0,Female,Master's,Product Manager,12.0,120000.0


In [42]:
# First 2 columns of the DataFrame
df.iloc[:, 0:2]

Unnamed: 0,Age,Gender
0,32.0,Male
1,28.0,Female
2,45.0,Male
3,36.0,Female
4,52.0,Male
...,...,...
370,35.0,Female
371,43.0,Male
372,29.0,Female
373,34.0,Male


In [43]:
# 2nd through 3rd rows and first 2 columns of the DataFrame
df.iloc[1:3, 0:2]

Unnamed: 0,Age,Gender
1,28.0,Female
2,45.0,Male


In [44]:
# 1st and 6th rows and 2nd and 4th columns of the DataFrame
df.iloc[[0, 5], [1, 3]]

Unnamed: 0,Gender,Job Title
0,Male,Software Engineer
5,Male,Marketing Analyst


## 4.8. DataFrame: Sorting
We can sort the data by a value in the column. By default, the sorting will occur in ascending order and a new data frame is returned.

In [45]:
# Create a new DataFrame from the original sorted by the column Salary
df_sorted = df.sort_values(by = "Salary")
df_sorted.head()

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
259,29.0,Male,Bachelor's,Junior Business Operations Analyst,1.5,350.0
82,25.0,Male,Bachelor's,Sales Representative,0.0,30000.0
97,26.0,Male,Bachelor's,Junior Software Developer,1.0,35000.0
218,29.0,Male,Bachelor's,Junior Business Operations Analyst,1.5,35000.0
49,25.0,Male,Bachelor's,Help Desk Analyst,0.0,35000.0


*We can sort the data using 2 or more columns...*

In [46]:
df_sorted = df.sort_values(by = ["Job Title", "Salary"], ascending = [True, False])
df_sorted.head(10)

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
46,32.0,Male,Bachelor's,Account Manager,5.0,75000.0
31,31.0,Female,Bachelor's,Accountant,4.0,55000.0
135,39.0,Female,Bachelor's,Administrative Assistant,10.0,55000.0
43,36.0,Female,Bachelor's,Administrative Assistant,8.0,45000.0
20,34.0,Female,Master's,Business Analyst,5.0,80000.0
94,33.0,Male,Bachelor's,Business Analyst,7.0,75000.0
68,34.0,Male,Master's,Business Development Manager,8.0,90000.0
51,33.0,Male,Master's,Business Intelligence Analyst,7.0,85000.0
30,50.0,Male,Bachelor's,CEO,25.0,250000.0
105,44.0,Male,PhD,Chief Data Officer,16.0,220000.0


## 4.9. Missing Values
Missing values are marked as `NaN`.

In [47]:
# Select the rows that have at least one missing values
df[df.isnull().any(axis=1)].head()

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
172,,,,,,
260,,,,,,


*There are a number of methods to deal with missing values in the DataFrame...*

| DataFrame Method                | Description                                            |
|---------------------------------|--------------------------------------------------------|
| `dropna()`                      | Drop *missing observations*.                           |
| `dropna(how = "all")`           | Drop observations where all cells in `NA`.             |
| `dropna(axis = 1, how = "all")` | Drop column if all the values are missing.             |
| `dropna(thresh = 5)`            | Drop rows that contain less than 5 non-missing values. |
| `fillna(0)`                     | Replace missing values with zeros.                     |
| `isnull()`                      | Returns True if the value is missing.                  |
| `notnull()`                     | Returns True for non-missing values.                   |

* When `summing up the data`, *missing values* will be treated as `zero`.
* If *all values are missing*, the sum will be equal to `NaN`.
* [cumsum()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.cumsum.html) and [cumprod()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.cumprod.html) methods ignore missing values but preserve them in the resulting arrays.
* Missing values in `groupby` method are excluded (just like in [R](https://www.r-project.org/)).
* Many descriptive statistics methods have `skipna` option to control if missing data should be excluded. This value is set to `True` by default (unlike [R](https://www.r-project.org/)).

# 5. Aggregation Functions in Pandas
[agg()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html) method are useful when multiple statistics are computed per column.

In [48]:
df[["Age", "Years of Experience"]].agg(["min", "mean", "max"])

Unnamed: 0,Age,Years of Experience
min,23.0,0.0
mean,37.431635,10.030831
max,53.0,25.0


## 5.1. Basic Descriptive Statistics

| DataFrame Method         | Description                                                           |
|--------------------------|-----------------------------------------------------------------------|
| `describe`               | Basic statistics (*count*, *mean*, *std*, *min*, *quantiles*, *max*). |
| `min`, `max`             | *Minimum* and *maximum* values.                                       |
| `mean`, `median`, `mode` | Arithmetic *average*, *median* and *mode*.                            |
| `var`, `std`             | *Variance* and *standard deviation*.                                  |
| `sem`                    | *Standard error of mean*.                                             |
| `skew`                   | *Sample skewness*.                                                    |
| `kurt`                   | *Kurtosis*.                                                           |

# 6. Graphics to Explore the Data
[Seaborn](https://seaborn.pydata.org/) package is built on [Matplotlib](https://matplotlib.org/) but provides high-level interface for drawing attractive statistical graphics, similar to `ggplot` library in [R](https://www.r-project.org/). It specifically targets statistical `data visualization`.

In [49]:
# To show graphs within Python notebook include inline directive
%matplotlib inline

## 6.1. Graphics

| Method       | Description                                                           |
|--------------|-----------------------------------------------------------------------|
| `distplot`   | *Histogram*.                                                          |
| `barplot`    | Estimate of *central tendency* for a numeric variable.                |
| `violinplot` | Similar to *boxplot*, also shows the probability density of the data. |
| `jointplot`  | *Scatterplot*.                                                        |
| `regplot`    | *Regression plot*.                                                    |
| `pairplot`   | *Pairplot*.                                                           |
| `boxplot`    | *Boxplot*.                                                            |
| `swarmplot`  | *Categorical scatterplot*.                                            |
| `factorplot` | *General categorical plot*.                                           |

# 7. Basic Statistical Analysis
[Statsmodels](https://www.statsmodels.org/stable/index.html) and [Scikit-learn](https://scikit-learn.org/) both have a number of functions for statistical analysis.

The first one is mostly used for regular analysis using [R](https://www.r-project.org/) style formulas, while [Scikit-learn](https://scikit-learn.org/) is more tailored for `Machine Learning`.

[Statsmodels](https://www.statsmodels.org/stable/index.html):
* [Linear Regression](https://www.statsmodels.org/stable/regression.html)
* [ANOVA](https://www.statsmodels.org/stable/anova.html)
* Hypothesis Testing
* *many more...*

[Scikit-learn](https://scikit-learn.org/):
* [K-Means Clustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)
* [Support Vector Machines](https://scikit-learn.org/stable/modules/svm.html)
* [Random Forest Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
* *many more...*