# Python for Data Analysis
1. Setting Up Necessary Things

# 1. Setting Up Necessary Things

In [1]:
# Ignore All Warnings
import warnings
warnings.filterwarnings("ignore")

# 2. Python Libraries for Data Science

Many popular Python libraries:
1. [NumPy](https://numpy.org/)
2. [SciPy](https://scipy.org/)
3. [Pandas](https://pandas.pydata.org/)
4. [Scikit-learn](https://scikit-learn.org/)

`Visualization` libraries:
1. [Matplotlib](https://matplotlib.org/)
2. [Seaborn](https://seaborn.pydata.org/)

*and many more...*

**All these libraries are installed on the [SCC](https://www.alibabacloud.com/product/scc).**

## 2.1. NumPy Library
Link: [NumPy](https://numpy.org/)

* Introduces objects for `multidimensional arrays and matrices`, as well as functions that allow to easily perform advanced mathematical and statistical operations on those objects.
* Provides `vectorization of mathematical operations` on arrays and matrices which significantly improves the performance.
* Many other Python libraries are built on NumPy.

## 2.2. SciPy Library
Link: [SciPy](https://scipy.org/)
* collection of algorithms for `linear algebra`, `differential equations`, `numerical integration`, `optimization`, `statistics` and more.
* Part of the SciPy Stack.
* built on `NumPy`.

## 2.3. Pandas Library
Link: [Pandas](https://pandas.pydata.org/)
* Adds **data structures** and **tools** designed to work with `tabular data` (similar to Series and DataFrame in [R](https://www.r-project.org/)).
* Provides tools for **data manipulation**: `Reshaping`, `Merging`, `Sorting`, `Slicing`, `Aggregation` etc.
* Allows handling `missing data`.

## 2.4. Scikit-learn Library
Link: [Scikit-learn](https://scikit-learn.org/)
* Provides **machine learning** algorithms: `Classification`, `Regression`, `Clustering`, `Model validation` etc.
* Built on `NumPy`, `SciPy` and `Matplotlib`.

## 2.5. Matplotlib Library
Link: [Matplotlib](https://matplotlib.org/)
* Python `2D plotting library` which produces publication quality figures in a variety of hardcopy formats.
* A set of functionalities similar to those of [MATLAB](https://www.mathworks.com/products/matlab.html).
* `Line Plots`, `Scatter Plots`, `Bar Charts`, `Histograms`, `Pie Charts` etc.
* Relatively Low-level, some efforts needed to create advanced visualization.

2.6. Seaborn Library
[Seaborn](https://seaborn.pydata.org/)
* Based on `Matplotlib`.
* Provides a high-level interface for drawing attractive statistical graphics.
* Similar (in style) to the popular [ggplot2](https://ggplot2.tidyverse.org/) library in [R](https://www.r-project.org/).


# 3. Python Libraries Practical

## 3.1. Loading Python Libraries

In [2]:
# Import Python Libraries
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as plt
import seaborn as sns

## 3.2. Reading Data Using Pandas

In [3]:
# Read CSV File
df = pd.read_csv("../../data/salary_data.csv")

*Note: The above command has many optional arguments to fine-tune the data import process...*

There is a number of `Pandas` commands to read other data formats:
```python
pd.read_excel("<my-file>.xlsx", sheet_name = "<sheet-name>", index_col = None, na_vlalues = ["NA"])

pd.read_stata("<my-file>.dta")

pd.read_sas("<my-file>.sas7bdat")

pd.read_hdf("<my-file>.h5", "df")
```

## 3.3. Exploring DataFrame

In [4]:
# First 5 Records
df.head()

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
0,32.0,Male,Bachelor's,Software Engineer,5.0,90000.0
1,28.0,Female,Master's,Data Analyst,3.0,65000.0
2,45.0,Male,PhD,Senior Manager,15.0,150000.0
3,36.0,Female,Bachelor's,Sales Associate,7.0,60000.0
4,52.0,Male,Master's,Director,20.0,200000.0


In [5]:
# First 10 Records
df.head(10)

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
0,32.0,Male,Bachelor's,Software Engineer,5.0,90000.0
1,28.0,Female,Master's,Data Analyst,3.0,65000.0
2,45.0,Male,PhD,Senior Manager,15.0,150000.0
3,36.0,Female,Bachelor's,Sales Associate,7.0,60000.0
4,52.0,Male,Master's,Director,20.0,200000.0
5,29.0,Male,Bachelor's,Marketing Analyst,2.0,55000.0
6,42.0,Female,Master's,Product Manager,12.0,120000.0
7,31.0,Male,Bachelor's,Sales Manager,4.0,80000.0
8,26.0,Female,Bachelor's,Marketing Coordinator,1.0,45000.0
9,38.0,Male,PhD,Senior Scientist,10.0,110000.0


In [8]:
# ? Last 5 Records
df.tail()

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
370,35.0,Female,Bachelor's,Senior Marketing Analyst,8.0,85000.0
371,43.0,Male,Master's,Director of Operations,19.0,170000.0
372,29.0,Female,Bachelor's,Junior Project Manager,2.0,40000.0
373,34.0,Male,Bachelor's,Senior Operations Coordinator,7.0,90000.0
374,44.0,Female,PhD,Senior Business Analyst,15.0,150000.0


In [9]:
# ? Last 10 Records
df.tail(10)

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
365,43.0,Male,Master's,Director of Marketing,18.0,170000.0
366,31.0,Female,Bachelor's,Junior Financial Analyst,3.0,50000.0
367,41.0,Male,Bachelor's,Senior Product Manager,14.0,150000.0
368,44.0,Female,PhD,Senior Data Engineer,16.0,160000.0
369,33.0,Male,Bachelor's,Junior Business Analyst,4.0,60000.0
370,35.0,Female,Bachelor's,Senior Marketing Analyst,8.0,85000.0
371,43.0,Male,Master's,Director of Operations,19.0,170000.0
372,29.0,Female,Bachelor's,Junior Project Manager,2.0,40000.0
373,34.0,Male,Bachelor's,Senior Operations Coordinator,7.0,90000.0
374,44.0,Female,PhD,Senior Business Analyst,15.0,150000.0


# 4. DataFrame

## 4.1. DataFrame Data Types
| Pandas Type                   | Native Python Type                                               | Description                                                                                                                                        |
|-------------------------------|------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------|
| `object`                      | `string`                                                         | The most general dtype. Will be assigned to your column if column has mixed types (numbers and strings).                                           |
| `int64`                       | `int`                                                            | Numeric characters. 64 refers to the memory allocated to hold this character.                                                                      |
| `float64`                     | `float`                                                          | Numeric characters with decimals. If a column contains numbers and NaNs, pandas will default to float64, in case your missing value has a decimal. |
| `datetime64`, `timedelta[ns]` | `N/A` (but see the datetime module in Python’s standard library) | Values meant to hold time data. Look into these for time series experiments.                                                                       |