# __Pandas__

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the __Python__ programming language. 

## __Featuers of pandas__

- Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data

- Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects

- Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations

- Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data

- Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects

- Intelligent label-based slicing, fancy indexing, and subsetting of large data sets

- Intuitive merging and joining data sets

- Flexible reshaping and pivoting of data sets

- Hierarchical labeling of axes (possible to have multiple labels per tick)

- Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast HDF5 format

- Time series-specific functionality: date range generation and frequency conversion, moving window statistics, date shifting, and lagging.



## __Data structures__

| Dimensions | Name      | Description                                                                                      |
| ---------- | --------- | ------------------------------------------------------------------------------------------------ |
| 1          | Series    | 1D labeled homogeneously-typed array                                                             |
| 2          | DataFrame | General 2D labeled, size-mutable tabular structure with potentially heterogeneously-typed column |

In [1]:
import pandas as pd
import numpy as np
from config import MEDICAL_INSURANCE

## __Series__
__Creating 1D Array__

In [2]:
array1 = [2,5,4,6,7]
data = pd.Series(array1)
data

0    2
1    5
2    4
3    6
4    7
dtype: int64

In [3]:
type(data)

pandas.core.series.Series

## __DataFrame__
__Creating 2D Array__

In [4]:
array2 = [[2,5,4,6,7], [12,59,45,61,45], [12,59,45,61,45], [12,59,45,61,45]]

pd.DataFrame(array2, columns=["test check", "B", "C", "D", "E"], index=["index-1", "index-2", "index-3", "index-4"])

Unnamed: 0,test check,B,C,D,E
index-1,2,5,4,6,7
index-2,12,59,45,61,45
index-3,12,59,45,61,45
index-4,12,59,45,61,45


In [5]:
df = pd.DataFrame(array2, columns=["test check", "B", "C", "D", "E"], index=["index-1", "index-2", "index-3", "index-4"])
type(df)

pandas.core.frame.DataFrame

In [6]:
df

Unnamed: 0,test check,B,C,D,E
index-1,2,5,4,6,7
index-2,12,59,45,61,45
index-3,12,59,45,61,45
index-4,12,59,45,61,45


## __df.iloc__

In [7]:
df.iloc[0::2]

Unnamed: 0,test check,B,C,D,E
index-1,2,5,4,6,7
index-3,12,59,45,61,45


## __df.loc__

In [8]:
df.loc[["index-2", "index-4"]]

Unnamed: 0,test check,B,C,D,E
index-2,12,59,45,61,45
index-4,12,59,45,61,45


## __df.describe()__

In [9]:
df.describe()

Unnamed: 0,test check,B,C,D,E
count,4.0,4.0,4.0,4.0,4.0
mean,9.5,45.5,34.75,47.25,35.5
std,5.0,27.0,20.5,27.5,19.0
min,2.0,5.0,4.0,6.0,7.0
25%,9.5,45.5,34.75,47.25,35.5
50%,12.0,59.0,45.0,61.0,45.0
75%,12.0,59.0,45.0,61.0,45.0
max,12.0,59.0,45.0,61.0,45.0


## __Manually Calculating Population std()__

- specific column name: __'test check'__

In [10]:
manual_variance_check = df.iloc[0:4, 0]
mvc_ls = manual_variance_check.tolist()
print(f"Converted to Python list:--> {mvc_ls}")

print(f"Mean Value of 'test check' column is:--> {np.sum(manual_variance_check)/ len(manual_variance_check)}")


Converted to Python list:--> [2, 12, 12, 12]
Mean Value of 'test check' column is:--> 9.5


### __Population std divides by \(N)\.__

- __std = sqrt of varience of the data__

In [11]:
print(f"standard deviation of Polulation data is:-->\n{np.sqrt(np.var(mvc_ls))}")


standard deviation of Polulation data is:-->
4.330127018922194


- __std with built-in function__

In [12]:
population_std = np.std(mvc_ls)
print(f"standard deviation of Polulation data using numpy in-built funtion is:-->\n{population_std}")

standard deviation of Polulation data using numpy in-built funtion is:-->
4.330127018922194


## __Manually Calculating Sample std()__

- specific column name: __'test check'__

In [13]:
mvc_ls

[2, 12, 12, 12]

### __Sample standard deviation divides by \(N-1)\.__

- __Calculate the squared difference from the mean:__

In [14]:
temp = 0
for i in mvc_ls:
    print(i,"==>",(i-np.sum(mvc_ls) / len(mvc_ls))**2)
    temp += (i-np.sum(mvc_ls) / len(mvc_ls))**2

2 ==> 56.25
12 ==> 6.25
12 ==> 6.25
12 ==> 6.25


- __Sum the squared differences:__

In [15]:
temp

np.float64(75.0)

- __Divide by \(N-1)\:__

In [16]:
sample_std = np.sqrt(temp/ (len(mvc_ls)-1))
print(f"standard deviation of Sample data column 'test check' is:-->\n{sample_std}")

standard deviation of Sample data column 'test check' is:-->
5.0


## __Final Outcome of STD__

In [17]:
print(f"{'*'*10} Population Standard Deviation {'*'*10}")
print(f"Population std of 'test check' is:\n{population_std}\n")

print(f"{'*'*10} Sample Standard Deviation {'*'*10}")
print(f"Sample std of 'test check' is:\n{sample_std}")

********** Population Standard Deviation **********
Population std of 'test check' is:
4.330127018922194

********** Sample Standard Deviation **********
Sample std of 'test check' is:
5.0


## __Reading CSV File__

In [18]:
df = pd.read_csv(MEDICAL_INSURANCE)
df

Unnamed: 0,age,gender,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


### __df.head(5)__

In [19]:
first_five_rows = df.head(5)
first_five_rows

Unnamed: 0,age,gender,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


### __np.nan__

In [20]:
first_five_rows.iloc[0::2, 0]

0    19
2    28
4    32
Name: age, dtype: int64

- __Setting 1st, 3rd & 5th 'age' column value NaN__

In [21]:
for x in first_five_rows.iloc[0::2, 0]:
    first_five_rows.iloc[0::2, 0] = np.nan

- __Visualizing the DataFrame with updated result__

In [22]:
first_five_rows

Unnamed: 0,age,gender,bmi,children,smoker,region,charges
0,,female,27.9,0,yes,southwest,16884.924
1,18.0,male,33.77,1,no,southeast,1725.5523
2,,male,33.0,3,no,southeast,4449.462
3,33.0,male,22.705,0,no,northwest,21984.47061
4,,male,28.88,0,no,northwest,3866.8552


### __df.to_csv__

- __Exporting new data CSV__

In [23]:
first_five_rows.to_csv("data/first5.csv")

### __df.read_csv__

- __Reading CSV from path__

In [24]:
new_df_data = pd.read_csv("data/first5.csv")
new_df_data

Unnamed: 0.1,Unnamed: 0,age,gender,bmi,children,smoker,region,charges
0,0,,female,27.9,0,yes,southwest,16884.924
1,1,18.0,male,33.77,1,no,southeast,1725.5523
2,2,,male,33.0,3,no,southeast,4449.462
3,3,33.0,male,22.705,0,no,northwest,21984.47061
4,4,,male,28.88,0,no,northwest,3866.8552


### __Dropping 1st column named as 'Unmaned: 01':__

In [25]:
new_df_data.drop(columns="Unnamed: 0", inplace=True)

In [26]:
new_df_data

Unnamed: 0,age,gender,bmi,children,smoker,region,charges
0,,female,27.9,0,yes,southwest,16884.924
1,18.0,male,33.77,1,no,southeast,1725.5523
2,,male,33.0,3,no,southeast,4449.462
3,33.0,male,22.705,0,no,northwest,21984.47061
4,,male,28.88,0,no,northwest,3866.8552


### __Cleaning columns of NaN Values.__

In [27]:
cleaned_data = new_df_data.dropna(axis=0)
cleaned_data

Unnamed: 0,age,gender,bmi,children,smoker,region,charges
1,18.0,male,33.77,1,no,southeast,1725.5523
3,33.0,male,22.705,0,no,northwest,21984.47061


### __df.fillna()__

In [28]:
new_df_data.fillna("ZERO")

Unnamed: 0,age,gender,bmi,children,smoker,region,charges
0,ZERO,female,27.9,0,yes,southwest,16884.924
1,18.0,male,33.77,1,no,southeast,1725.5523
2,ZERO,male,33.0,3,no,southeast,4449.462
3,33.0,male,22.705,0,no,northwest,21984.47061
4,ZERO,male,28.88,0,no,northwest,3866.8552


### __df.groupby()__

In [29]:
grouped_data = new_df_data.groupby(["age", "gender", "smoker"])
pd.DataFrame(grouped_data)[0]

0    (18.0, male, no)
1    (33.0, male, no)
Name: 0, dtype: object