# Python Library for Data Science

There are many popular Python toolboxes/libraries:
* Numpy
* Scipy
* Pandas
* SciKit-Learn

Visualization library
* Matplotlib
* Seabord

## Numpy


* introduces objects for multidimensional arrays and matrices, as well as functions that allow to easily perform advanced mathematical and statistical operations on those objects
* provides vectorization of mathematical operations on arrays and matrices which significantly improves the performance
* many other python libraries are built on NumPy


## SciPy

* collection of algorithms for linear algebra, differential equations, numerical integration, optimization, statistics and more
* built on NumPy


## Pandas

* adds data structures and tools designed to work with table-like data (similar to Series and Data Frames in R)
* provides tools for data manipulation: reshaping, merging, sorting, slicing, aggregation etc.
* allows handling missing data

## SciKit-Learn
* provides machine learning algorithms: classification, regression, clustering, model validation etc.
* built on NumPy, SciPy and matplotlib

## matplotlib

* python 2D plotting library which produces publication quality figures in a variety of hardcopy formats
* a set of functionalities similar to those of MATLAB
* line plots, scatter plots, barcharts, histograms, pie charts etc.
* relatively low-level; some effort needed to create advanced visualization




## Seaborn

* based on matplotlib
* provides high level interface for drawing attractive statistical graphics
* Similar (in style) to the popular ggplot2 library in R


# Pandas

* Open-source
* High-performance
* Easy to use data structure
* Data analysis tools
* The Data like Excel


## What Pandas can do

* Modeling the data
* Create the data frame
* Series
 - One-dimension array
 - Similar to the Numpy arrays


## Pandas - Data Frame

* Data frame
  * The spreadsheet like
* Using to prepare data
  * For data manipulation

## Data Frame data types
![image-20230806153235679](./assets/image-20230806153235679.png)

## Data Frame attribute
Python objects have *attributes* and *methods*
![image-20230806153331618](./assets/image-20230806153331618.png)


# Importing the module

import the Pandas package
```python
import pandas as pd
import numpy as np
```

In [None]:
#Import Python Libraries

# Data Structure - Series

**Series** (1d homogeneous array)

Similar to the NumPy data type

The simple array can be created as given


```python
obj = ([4,7,-5,3])
obj
```

The Series in Pandas can created

```python
obj = pd.Series([4,7,-5,3])
obj
```

Normally the index is added automatically (The index 0-3 is shown in the previous sesion)
The index, and value can be shown as given

```python
obj.values
```


```python
obj.index
```

The index can be created to refer to each data

```python
obj2 = pd.Series([4,7,-5,3],index=['d','b','a','c'])
obj2
```

Then we can check all indexs and value using the values, and indexs data
```python
obj2.values
```

```python
obj2.index
```

To get some data, we can slice the data from the series using index, or index

```python
obj2['a']
```

```python
obj2[['b','c']]
```

We can also use the dict data structure to a series

```python
sdata = {'Ohio':3500, 'Texas':71000,'Oregon':16000, 'Utah':5000}
obj3 = pd.Series(sdata)
obj3
```

When input the index parameter, only the value that match the index will be shown

```python
states = ['California','Ohio','Oregon','Texas']
obj4 = pd.Series(sdata,index = states)
obj4
```

`Nan` is the data that does not provide the data for the index

We can filter the data by adding the boolean in the index
```python
obj4[obj4<20000]
```

To get the name of index use index data
```python
obj4[obj4 <20000].index
```

## Task Data Structure - Series Hand-ons
create a series of students using the student id as an index, and the name as the first values. The name and student id is given


![image-20230806154629050](./assets/image-20230806154629050.png)

# Data Structure - Dataframe

The tubular, spreedsheet-like data structure.

contains and ordered collection of columns

Can be though as a Dict of series

The data frame is used to manipulate the data, and we can extract the output of the data science modules by extracting the value in the Daaframe
we can try to create the data frame as given

```python
data = {'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
        'year' :[2000  ,2001  ,2002  ,2001    ,2002],
        'pop'  :[1.5   ,1.7   ,3.6   ,2.4     ,2.9]}
frame = pd.DataFrame(data)
frame
```

We can also arrange the column using the columns parameter

```python
frame2 = pd.DataFrame(data, columns=['year','state','pop'])
frame2
```

The index can be set, and the column which is not in the data is also shown as Nan

```python
frame2 = pd.DataFrame(data, columns=['year','state','pop','debt'],
                      index=['one','two','three','four','five'])
frame2
```


We can extract data regarding to the columns as given

```python
frame2['state']
```

```python
frame2.year
```

we can get the data from each index using `loc` methods

```python
frame2.loc['three']
```

We can assign the data to each column using the scalar data, or list

```python
frame2['debt'] = 16.5
frame2
```

```python
frame2['debt'] = np.arange(5)
frame2
```

Or we can add the series as the missing column

```python
val = pd.Series([-1.2,-1.5,-1.7],index=['two','four','five'])
frame2['debt'] = val
frame2
```

```python
frame.describe()
```

Other form of creating the the data frame is a nested dict of dicts format

```python
pop = {'Nevada': {2001:2.4,2002:2.9},
       'Ohio'  : {2000:1.5,2001:1.7, 2002: 3.6}}
frame3 = pd.DataFrame(pop)
frame3
```

`Nan`, means the data is not provided.
The Pandas Dataframe is the symetrix tuple, it should provides as a table.

We can transpose the result


```python
frame3.T
```

## Task Data Structure - Data frame

From your previous work,
Add the column midterm score, and attendance score to all the students