# What is pandas?

## Overview

### Objectives

+ Know why pandas is suitable for data analysis in Python
+ Identify a DataFrame as 

### Resources

+ [Official Documentation](http://pandas.pydata.org/pandas-docs/stable/)
+ [Package Overview](http://pandas.pydata.org/pandas-docs/stable/overview.html)
+ [Intro to Data Structures](http://pandas.pydata.org/pandas-docs/stable/dsintro.html)

## Welcome to ....
![][1]


### What is pandas?
pandas is one of the most popular open source data exploration libraries currently available. It gives its users the power to explore, manipulate, query, aggregate, and visualize **tabular** data. Tabular meaning data that is two-dimensional with rows and columns; i.e. a table.

### Why pandas ?
In this current age of data explosion, there are now many dozens of other tools that have many of the same capabilities as the pandas library. However, there are many aspects of pandas that make it an attractive choice for data analysis and it continues to have one of the fastest growing user bases.

* It's a Python library and integrates well with the other popular data science libraries such as numpy, scikit-learn, statsmodels, matplotlib and seaborn.
* It is nearly self-contained in that lots of functionality is built into one package. This contrasts with R, where many packages are needed to obtain the same functionality.
* The community is excellent. Looking at Stack Overflow, for example, there are [many ten's of thousands of][2] pandas questions. If you need help, you are nearly guaranteed to find it very quickly. 

### Why is it named after an East Asian bear?

The pandas library was begun by Wes McKinney beginning in 2008 at a hedge fund named AQR. Finance speak is to call tabular data 'panel data' which smashed together becomes pandas. If you are really interested in the history, you can hear it from the creator [himself][3].

### Python already has data structures to handle data, why do we need another one?

Even though Python is a high-level language, its primary built-in data structures lists and dictionaries, do not easily lend themselves to tabular data analysis in ways that humans can operate on them. 

### pandas is built directly on numpy

[numpy][4] ('numerical Python') is the most popular third-party Python library for scientific computing and forms the foundation for dozens of others, including pandas. numpy's primary data structure is an n-dimensional array which is much more powerful than a Python list and with much better performance.

All of the data in pandas is stored in numpy arrays. That said, it isn't necessary to know much about numpy when learning pandas. You can think of pandas as a higher-level, easier to use interface for doing data analysis than numpy. It is a good idea to eventually learn numpy, but for most tasks, pandas will be the right tool.


## pandas operates on tabular (table) data

There are numerous formats for data such as XML, JSON, raw bytes, and many others. But, for our purposes, we will only be examining what most people think of when they think of data - a table. pandas is built just for analyzing this tabular, rectangular, very deceptively normal concept of data. pandas has the capability to read in many different formats of data, but they all will be converted to tabular data.

### The DataFrame and Series

The DataFrame and Series are the two primary pandas objects that we will be using throughout this course.

* **DataFrame** - A two-dimensional data structure that looks like any other rectangular table of data you have seen with rows and columns.
* **Series** - A single dimension of data. It is analogous to a single column of data or a one dimensional array.

[1]: images/pandas_logo.png
[2]: http://stackoverflow.com/questions/tagged/pandas
[3]: https://www.youtube.com/watch?v=kHdkFyGCxiY
[4]: http://www.numpy.org/

## pandas examples

* Reading data
* Filtering data
* Aggregating methods
* Non-Aggregating methods
* Aggregating within groups
* Tidying data
* Joining data
* Time series analysis
* Visualization

### The `head` method

You will notice that many of the last lines of code end with the `head` method. This returns, by default, the first five rows. This helps keep the output compact.

In [199]:
import pandas as pd
import numpy as np

In [200]:
# Pandas series
df = pd.Series([10,12,13,14,15])
df

0    10
1    12
2    13
3    14
4    15
dtype: int64

In [201]:
# Pandas series
df = pd.Series([10,12,13,14,15],index = ['a','b','c','d','e'])
df

a    10
b    12
c    13
d    14
e    15
dtype: int64

In [202]:
dict1 = {'Model':['A','B','C','D','E'],'Price': [1000,2000,300,4000,500],'Items':[2,3,4,5,9]}
df = pd.DataFrame(dict1)
df

Unnamed: 0,Model,Price,Items
0,A,1000,2
1,B,2000,3
2,C,300,4
3,D,4000,5
4,E,500,9


In [203]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Model   5 non-null      object
 1   Price   5 non-null      int64 
 2   Items   5 non-null      int64 
dtypes: int64(2), object(1)
memory usage: 248.0+ bytes


In [204]:
df['Total'] = df['Price']* df['Items']  # Create new col
df

Unnamed: 0,Model,Price,Items,Total
0,A,1000,2,2000
1,B,2000,3,6000
2,C,300,4,1200
3,D,4000,5,20000
4,E,500,9,4500


In [205]:
# drop cols
df.drop('Total',axis=1,inplace=True)
# df = df.drop('Total',axis=1)

In [206]:
df

Unnamed: 0,Model,Price,Items
0,A,1000,2
1,B,2000,3
2,C,300,4
3,D,4000,5
4,E,500,9


In [207]:
df.set_index('Model',inplace=True)

In [208]:
df['Total'] = df['Price']* df['Items']
df

Unnamed: 0_level_0,Price,Items,Total
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,1000,2,2000
B,2000,3,6000
C,300,4,1200
D,4000,5,20000
E,500,9,4500


In [None]:
#df.loc[rowlabel]
#df.iloc[index-num]

In [209]:
df.loc['A'] # rowlabel

Price    1000
Items       2
Total    2000
Name: A, dtype: int64

In [210]:
df.loc['A',['Items','Total']] # subset of row and cols -labels

Items       2
Total    2000
Name: A, dtype: int64

In [211]:
# C and D rows + Price and Total
df.loc[['C','D'],['Price','Total']]

Unnamed: 0_level_0,Price,Total
Model,Unnamed: 1_level_1,Unnamed: 2_level_1
C,300,1200
D,4000,20000


In [24]:
df.loc[:,['Price','Items']] # All rows and Cols-Price and Items

Unnamed: 0_level_0,Price,Items
Model,Unnamed: 1_level_1,Unnamed: 2_level_1
A,1000,2
B,2000,3
C,300,4
D,4000,5
E,500,9


In [212]:
df.iloc[[0,1]]  #index based rows A and B

Unnamed: 0_level_0,Price,Items,Total
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,1000,2,2000
B,2000,3,6000


In [213]:
df.iloc[0:2]

Unnamed: 0_level_0,Price,Items,Total
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,1000,2,2000
B,2000,3,6000


In [214]:
# extract all rows of column Price using iloc
df.iloc[:,[0]]

Unnamed: 0_level_0,Price
Model,Unnamed: 1_level_1
A,1000
B,2000
C,300
D,4000
E,500


In [215]:
# Extract all rows and cols except last one using iloc
df.iloc[:-1]

Unnamed: 0_level_0,Price,Items,Total
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,1000,2,2000
B,2000,3,6000
C,300,4,1200
D,4000,5,20000


## Components of a DataFrame - columns, index, and data
The DataFrame is composed of three separate components that you must know. The **columns**, the **index**, and the **data**. look at the following graphic of our `bikes` DataFrame stylized to put emphasis on each component.

![][1]

[1]: images/df_components.png

* The **index** provides a label for each row
* The **columns** provide a label for each column
* The **index** is also referred to as the **row names/labels**
* The **columns** are also referred to as the **column names/labels** or the **column index**
* An individual element of the index is referred to as an **index label/name** or **row label/name**
* An individual element of the columns is a **column name/label**
* The index and the columns are always in **bold font**
* Collectively the index and the columns are known as the **axes** (or individually as an **axis**)
* pandas uses integers to refer to each axis; 0 for the index and 1 for the columns. This is borrowed directly from numpy
* The actual **data** is always in normal font
* The **data** is also referred to as the **values**

In [None]:
# Pandas Series

In [None]:
# Creating dataframes

In [216]:
df1 = pd.DataFrame()
states = ['Kerala','karnataka','Gujarat','Andhra Pradesh','Assam']
df1['states'] = states
df1['pincode'] = [560040,534001,500010,781021,500010]
df1

Unnamed: 0,states,pincode
0,Kerala,560040
1,karnataka,534001
2,Gujarat,500010
3,Andhra Pradesh,781021
4,Assam,500010


In [217]:
df1.set_index('states',inplace=True)
df1

Unnamed: 0_level_0,pincode
states,Unnamed: 1_level_1
Kerala,560040
karnataka,534001
Gujarat,500010
Andhra Pradesh,781021
Assam,500010


In [218]:
df1.reset_index(inplace=True)

In [219]:
df1

Unnamed: 0,states,pincode
0,Kerala,560040
1,karnataka,534001
2,Gujarat,500010
3,Andhra Pradesh,781021
4,Assam,500010


In [220]:
df1.to_csv("sample.csv")

In [221]:
import os
os.getcwd()

'D:\\DELOITTE\\DELOITTE_BATCH2\\Python\\batch-B'

In [222]:
list1 = [1,np.nan,3,4,10]
list2 = [np.nan,5,6,7,8]
dfnew = pd.DataFrame(zip(list1,list2),columns=['A','B'])
dfnew

Unnamed: 0,A,B
0,1.0,
1,,5.0
2,3.0,6.0
3,4.0,7.0
4,10.0,8.0


In [223]:
dfnew.isna()

Unnamed: 0,A,B
0,False,True
1,True,False
2,False,False
3,False,False
4,False,False


In [224]:
dfnew.isna().sum()

A    1
B    1
dtype: int64

In [225]:
df = pd.read_csv("occupation.csv",sep="|")

In [226]:
df.head()

Unnamed: 0,user_id,age,gender,occupation,zipcode
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


In [None]:
# how to access specific columns from dataframe

In [227]:
df.head(10)

Unnamed: 0,user_id,age,gender,occupation,zipcode
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213
5,6,42,M,executive,98101
6,7,57,M,administrator,91344
7,8,36,M,administrator,5201
8,9,29,M,student,1002
9,10,53,M,lawyer,90703


In [228]:
df.tail(10)

Unnamed: 0,user_id,age,gender,occupation,zipcode
933,934,61,M,engineer,22902
934,935,42,M,doctor,66221
935,936,24,M,other,32789
936,937,48,M,educator,98072
937,938,38,F,technician,55038
938,939,26,F,student,33319
939,940,32,M,administrator,2215
940,941,20,M,student,97229
941,942,48,F,librarian,78209
942,943,22,M,student,77841


In [51]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 943 entries, 0 to 942
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   user_id     943 non-null    int64 
 1   age         943 non-null    int64 
 2   gender      943 non-null    object
 3   occupation  943 non-null    object
 4   zipcode     943 non-null    object
dtypes: int64(2), object(3)
memory usage: 37.0+ KB


In [229]:
df.columns

Index(['user_id', 'age', 'gender', 'occupation', 'zipcode'], dtype='object')

In [230]:
df['gender'].value_counts()

M    670
F    273
Name: gender, dtype: int64

In [231]:
df.dtypes

user_id        int64
age            int64
gender        object
occupation    object
zipcode       object
dtype: object

In [232]:
df.isnull().sum()

user_id       0
age           0
gender        0
occupation    0
zipcode       0
dtype: int64

In [233]:
# How many different occupations are available in this dataset?

In [234]:
df['occupation'].nunique()

21

In [235]:
# Find the count of users for each occupation.

In [236]:
df['occupation'].value_counts()

student          196
other            105
educator          95
administrator     79
engineer          67
programmer        66
librarian         51
writer            45
executive         32
scientist         31
artist            28
technician        27
marketing         26
entertainment     18
healthcare        16
retired           14
lawyer            12
salesman          12
none               9
doctor             7
homemaker          7
Name: occupation, dtype: int64

In [None]:
# Which occupation that has most entries?

In [237]:
df['occupation'].value_counts().sort_values(ascending=False).head(1)

student    196
Name: occupation, dtype: int64

In [None]:
# What is the average age of users?

In [238]:
int(df['age'].mean()) #integer

34

In [239]:
round(df['age'].mean(),2) #round off

34.05

In [240]:
# What is the age with most occurence ?

In [241]:
df['age'].value_counts().sort_values(ascending=False).head(1) #max occurence

30    39
Name: age, dtype: int64

In [None]:
# What is the minimum and maximum age of the user?

In [242]:
df['age'].min() #min age

7

In [243]:
df['age'].max() #max age

73

In [None]:
# Display all records for users of age between 20 and 50

In [244]:
df[(df['age']>20) & (df['age']< 50)] # 709 records

Unnamed: 0,user_id,age,gender,occupation,zipcode
0,1,24,M,technician,85711
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213
5,6,42,M,executive,98101
...,...,...,...,...,...
937,938,38,F,technician,55038
938,939,26,F,student,33319
939,940,32,M,administrator,02215
941,942,48,F,librarian,78209


In [245]:
df[(df['age']>20) & (df['age']< 50)].shape[0] #Number of records- age> 20 and age< 50

709

In [None]:
# set the user_id as index column 

In [246]:
df.set_index('user_id',inplace=True)

In [247]:
df.head()

Unnamed: 0_level_0,age,gender,occupation,zipcode
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043
3,23,M,writer,32067
4,24,M,technician,43537
5,33,F,other,15213


In [None]:
#drop user_id column

In [248]:
df.reset_index(inplace=True)

In [250]:
df.drop(['user_id'],axis=1,inplace=True)

In [251]:
df.head()

Unnamed: 0,age,gender,occupation,zipcode
0,24,M,technician,85711
1,53,F,other,94043
2,23,M,writer,32067
3,24,M,technician,43537
4,33,F,other,15213


In [None]:
# Display all the records from 4th row to the 8th row

In [252]:
df[3:9]

Unnamed: 0,age,gender,occupation,zipcode
3,24,M,technician,43537
4,33,F,other,15213
5,42,M,executive,98101
6,57,M,administrator,91344
7,36,M,administrator,5201
8,29,M,student,1002


In [None]:
# How many users belong to the occupation writer ?

In [94]:
df['occupation'].value_counts()

student          196
other            105
educator          95
administrator     79
engineer          67
programmer        66
librarian         51
writer            45
executive         32
scientist         31
artist            28
technician        27
marketing         26
entertainment     18
healthcare        16
retired           14
lawyer            12
salesman          12
none               9
doctor             7
homemaker          7
Name: occupation, dtype: int64

In [253]:
df.loc[df['occupation'] == 'writer'].shape[0]

45

In [254]:
x=df.loc[df['occupation'] == 'writer','occupation'].count()
x

45

In [None]:
# How many users donot belong to occupation writer ?

In [255]:
df[df['occupation']!='writer'].shape[0]

898

In [None]:
# How many male and female users?

In [102]:
df['gender'].value_counts()

M    670
F    273
Name: gender, dtype: int64

In [None]:
# How to apply a function to a column and also how to create  a new column ?

In [256]:
df['gender_no'] = df['gender'].replace(['M','F'],['1','0'])
df

Unnamed: 0,age,gender,occupation,zipcode,gender_no
0,24,M,technician,85711,1
1,53,F,other,94043,0
2,23,M,writer,32067,1
3,24,M,technician,43537,1
4,33,F,other,15213,0
...,...,...,...,...,...
938,26,F,student,33319,0
939,32,M,administrator,02215,1
940,20,M,student,97229,1
941,48,F,librarian,78209,0


In [None]:
# create a function
# apply the function to the column and create a new column 'gender_number'


In [257]:
df['gender'].replace({'M':0,'F':1},inplace=True)


In [None]:
lambda x,y : x+y

In [258]:
df['gender_number'] = df['gender'].apply(lambda x:1 if x == 'M' else 0)
df.head()

Unnamed: 0,age,gender,occupation,zipcode,gender_no,gender_number
0,24,0,technician,85711,1,0
1,53,1,other,94043,0,0
2,23,0,writer,32067,1,0
3,24,0,technician,43537,1,0
4,33,1,other,15213,0,0


## Back by 4:35 PM

In [None]:
# Change M to value 1 and change F to value 0

In [None]:
# What is the mean age per occupation?

In [259]:
round(df.groupby('occupation')['age'].mean(),2)

occupation
administrator    38.75
artist           31.39
doctor           43.57
educator         42.01
engineer         36.39
entertainment    29.22
executive        38.72
healthcare       41.56
homemaker        32.57
lawyer           36.75
librarian        40.00
marketing        37.62
none             26.56
other            34.52
programmer       33.12
retired          63.07
salesman         35.67
scientist        35.55
student          22.08
technician       33.15
writer           36.31
Name: age, dtype: float64

In [None]:
# Find the minimum and maximum age of the users for each occupation

In [260]:
df.groupby('occupation')['age'].min()

occupation
administrator    21
artist           19
doctor           28
educator         23
engineer         22
entertainment    15
executive        22
healthcare       22
homemaker        20
lawyer           21
librarian        23
marketing        24
none             11
other            13
programmer       20
retired          51
salesman         18
scientist        23
student           7
technician       21
writer           18
Name: age, dtype: int64

In [261]:
df.groupby('occupation')['age'].max()

occupation
administrator    70
artist           48
doctor           64
educator         63
engineer         70
entertainment    50
executive        69
healthcare       62
homemaker        50
lawyer           53
librarian        69
marketing        55
none             55
other            64
programmer       63
retired          73
salesman         66
scientist        55
student          42
technician       55
writer           60
Name: age, dtype: int64

In [262]:
df.groupby(['occupation','gender'])['age'].agg(['min','max','mean'])

Unnamed: 0_level_0,Unnamed: 1_level_0,min,max,mean
occupation,gender,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
administrator,0,21,70,37.162791
administrator,1,22,62,40.638889
artist,0,20,45,32.333333
artist,1,19,48,30.307692
doctor,0,28,64,43.571429
educator,0,25,63,43.101449
educator,1,23,51,39.115385
engineer,0,22,70,36.6
engineer,1,23,36,29.5
entertainment,0,15,50,29.0


In [263]:
df.groupby(['occupation','gender'])['gender'].agg(['count'])

Unnamed: 0_level_0,Unnamed: 1_level_0,count
occupation,gender,Unnamed: 2_level_1
administrator,0,43
administrator,1,36
artist,0,15
artist,1,13
doctor,0,7
educator,0,69
educator,1,26
engineer,0,65
engineer,1,2
entertainment,0,16


In [None]:
# For each combination of occupation and gender,calculate the mean age

# administrator: mean age of females and mean age of males

In [264]:
df.groupby(['occupation','gender'])['age'].mean()

occupation     gender
administrator  0         37.162791
               1         40.638889
artist         0         32.333333
               1         30.307692
doctor         0         43.571429
educator       0         43.101449
               1         39.115385
engineer       0         36.600000
               1         29.500000
entertainment  0         29.000000
               1         31.000000
executive      0         38.172414
               1         44.000000
healthcare     0         45.400000
               1         39.818182
homemaker      0         23.000000
               1         34.166667
lawyer         0         36.200000
               1         39.500000
librarian      0         40.000000
               1         40.000000
marketing      0         37.875000
               1         37.200000
none           0         18.600000
               1         36.500000
other          0         34.028986
               1         35.472222
programmer     0         33.21666

In [265]:
df.groupby(['occupation','gender']).agg({'age':['mean']})

Unnamed: 0_level_0,Unnamed: 1_level_0,age
Unnamed: 0_level_1,Unnamed: 1_level_1,mean
occupation,gender,Unnamed: 2_level_2
administrator,0,37.162791
administrator,1,40.638889
artist,0,32.333333
artist,1,30.307692
doctor,0,43.571429
educator,0,43.101449
educator,1,39.115385
engineer,0,36.6
engineer,1,29.5
entertainment,0,29.0


In [266]:
df.groupby(['occupation','gender'])['age'].agg(['mean'])

Unnamed: 0_level_0,Unnamed: 1_level_0,mean
occupation,gender,Unnamed: 2_level_1
administrator,0,37.162791
administrator,1,40.638889
artist,0,32.333333
artist,1,30.307692
doctor,0,43.571429
educator,0,43.101449
educator,1,39.115385
engineer,0,36.6
engineer,1,29.5
entertainment,0,29.0


In [None]:
# Count of males and females for each occupation

In [267]:
df.groupby(['occupation','gender'])['gender'].count()

occupation     gender
administrator  0          43
               1          36
artist         0          15
               1          13
doctor         0           7
educator       0          69
               1          26
engineer       0          65
               1           2
entertainment  0          16
               1           2
executive      0          29
               1           3
healthcare     0           5
               1          11
homemaker      0           1
               1           6
lawyer         0          10
               1           2
librarian      0          22
               1          29
marketing      0          16
               1          10
none           0           5
               1           4
other          0          69
               1          36
programmer     0          60
               1           6
retired        0          13
               1           1
salesman       0           9
               1           3
scientist      0     

In [124]:
df.head()

Unnamed: 0,age,gender,occupation,zipcode,gender_no,gender_number
0,24,M,technician,85711,1,1
1,53,F,other,94043,0,0
2,23,M,writer,32067,1,1
3,24,M,technician,43537,1,1
4,33,F,other,15213,0,0


In [None]:
# Convert all records in occupation column to uppercase

In [269]:
df['occupation']=df['occupation'].str.upper()
df

Unnamed: 0,age,gender,occupation,zipcode,gender_no,gender_number
0,24,0,TECHNICIAN,85711,1,0
1,53,1,OTHER,94043,0,0
2,23,0,WRITER,32067,1,0
3,24,0,TECHNICIAN,43537,1,0
4,33,1,OTHER,15213,0,0
...,...,...,...,...,...,...
938,26,1,STUDENT,33319,0,0
939,32,0,ADMINISTRATOR,02215,1,0
940,20,0,STUDENT,97229,1,0
941,48,1,LIBRARIAN,78209,0,0


In [None]:
# rename the columns age to current_age and occupation to current_occupation

In [270]:
df.rename(columns={'age':'current age','occupation':'current occupation'},inplace=True)
#df = df.rename(columns={'age':'current age','occupation':'current occupation'})

In [None]:
df.rename(columns={'age':"current_age",'occupation':"Current_occupation"},inplace=True)

In [139]:
df

Unnamed: 0,current age,gender,current occupation,zipcode,gender_no,gender_number
0,24,M,TECHNICIAN,85711,1,1
1,53,F,OTHER,94043,0,0
2,23,M,WRITER,32067,1,1
3,24,M,TECHNICIAN,43537,1,1
4,33,F,OTHER,15213,0,0
...,...,...,...,...,...,...
938,26,F,STUDENT,33319,0,0
939,32,M,ADMINISTRATOR,02215,1,1
940,20,M,STUDENT,97229,1,1
941,48,F,LIBRARIAN,78209,0,0


In [143]:
df.columns=['current age','gender','current occu','zipcode','gender_no','gender_numb']
df

Unnamed: 0,current age,gender,current occu,zipcode,gender_no,gender_numb
0,24,M,TECHNICIAN,85711,1,1
1,53,F,OTHER,94043,0,0
2,23,M,WRITER,32067,1,1
3,24,M,TECHNICIAN,43537,1,1
4,33,F,OTHER,15213,0,0
...,...,...,...,...,...,...
938,26,F,STUDENT,33319,0,0
939,32,M,ADMINISTRATOR,02215,1,1
940,20,M,STUDENT,97229,1,1
941,48,F,LIBRARIAN,78209,0,0


In [271]:
df.drop(['gender','gender_no'],axis=1,inplace= True)
df

Unnamed: 0,current age,current occupation,zipcode,gender_number
0,24,TECHNICIAN,85711,0
1,53,OTHER,94043,0
2,23,WRITER,32067,0
3,24,TECHNICIAN,43537,0
4,33,OTHER,15213,0
...,...,...,...,...
938,26,STUDENT,33319,0
939,32,ADMINISTRATOR,02215,0
940,20,STUDENT,97229,0
941,48,LIBRARIAN,78209,0


In [272]:
data = pd.read_csv("loan_data_set.csv")
# pd.read_excel for excel files data.xlsx
data.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [273]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


In [274]:
data.shape

(614, 13)

In [275]:
data.isnull().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [148]:
data.dropna() # reduced to 480 records

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
5,LP001011,Male,Yes,2,Graduate,Yes,5417,4196.0,267.0,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,LP002978,Female,No,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y
610,LP002979,Male,Yes,3+,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,Y
611,LP002983,Male,Yes,1,Graduate,No,8072,240.0,253.0,360.0,1.0,Urban,Y
612,LP002984,Male,Yes,2,Graduate,No,7583,0.0,187.0,360.0,1.0,Urban,Y


In [155]:
data['LoanAmount'].mean()

146.41216216216216

In [276]:
data['LoanAmount'].fillna(data['LoanAmount'].mean(),inplace=True)
data.isnull().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount            0
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [277]:
data['Gender'].value_counts()

Male      489
Female    112
Name: Gender, dtype: int64

In [278]:
data['Gender'].fillna('Male',inplace=True)
data.isnull().sum()

Loan_ID               0
Gender                0
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount            0
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [279]:
from datetime import datetime

In [280]:
dateyr = {'dates':['20220102','20220103','20220121'],'classes':['a','b','c']}
data = pd.DataFrame(dateyr)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   dates    3 non-null      object
 1   classes  3 non-null      object
dtypes: object(2)
memory usage: 176.0+ bytes


In [281]:
data['dates'] = pd.to_datetime(data['dates'])
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype         
---  ------   --------------  -----         
 0   dates    3 non-null      datetime64[ns]
 1   classes  3 non-null      object        
dtypes: datetime64[ns](1), object(1)
memory usage: 176.0+ bytes


In [282]:
data['dates'].dt.day

0     2
1     3
2    21
Name: dates, dtype: int64

In [283]:
data['dates'].dt.month

0    1
1    1
2    1
Name: dates, dtype: int64

In [284]:
data['dates'].dt.year

0    2022
1    2022
2    2022
Name: dates, dtype: int64

In [285]:
d1 = datetime.now()
d1

datetime.datetime(2022, 1, 31, 18, 27, 7, 341368)

In [286]:
d1.weekday()

0

In [287]:
d1.ctime() # convert to string format


'Mon Jan 31 18:27:07 2022'

In [288]:
#extract date
d1.ctime()[:10]

'Mon Jan 31'

In [289]:
#timedelta
d1 + pd.to_timedelta(2,unit='d') 

datetime.datetime(2022, 2, 2, 18, 27, 7, 341368)

In [290]:
# create dataframe
# 1st col = startdates
# 2nd col = enddates
# 3rd col = duration

In [291]:
start = pd.date_range('2021-11-02',periods=10,freq='D')
start

DatetimeIndex(['2021-11-02', '2021-11-03', '2021-11-04', '2021-11-05',
               '2021-11-06', '2021-11-07', '2021-11-08', '2021-11-09',
               '2021-11-10', '2021-11-11'],
              dtype='datetime64[ns]', freq='D')

In [292]:
end = pd.date_range('2022-01-31',periods=10,freq='D')
end

DatetimeIndex(['2022-01-31', '2022-02-01', '2022-02-02', '2022-02-03',
               '2022-02-04', '2022-02-05', '2022-02-06', '2022-02-07',
               '2022-02-08', '2022-02-09'],
              dtype='datetime64[ns]', freq='D')

In [293]:
df = pd.DataFrame()
df['start']=start
df['end']=end
df['Duration'] = df['end'] - df['start']

In [294]:
df['start_day'] = df['start'].dt.day

In [295]:
df

Unnamed: 0,start,end,Duration,start_day
0,2021-11-02,2022-01-31,90 days,2
1,2021-11-03,2022-02-01,90 days,3
2,2021-11-04,2022-02-02,90 days,4
3,2021-11-05,2022-02-03,90 days,5
4,2021-11-06,2022-02-04,90 days,6
5,2021-11-07,2022-02-05,90 days,7
6,2021-11-08,2022-02-06,90 days,8
7,2021-11-09,2022-02-07,90 days,9
8,2021-11-10,2022-02-08,90 days,10
9,2021-11-11,2022-02-09,90 days,11


In [296]:
import calendar
print(calendar.month(2022,1))

    January 2022
Mo Tu We Th Fr Sa Su
                1  2
 3  4  5  6  7  8  9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
31

