![1_93CVLqnQESmvfOhzvYUgQw.png](attachment:1_93CVLqnQESmvfOhzvYUgQw.png)

- Pandas is an open-source, BSD-licensed Python library.
- provide high-performance, easy-to-use data structures and data analysis tools for the Python programming language.


# Key Features of Pandas:

   - Fast and efficient DataFrame object with default and customized indexing.
   - Tools for loading data into in-memory data objects from different file formats.
   - Data alignment and integrated handling of missing data.
   - Reshaping and pivoting of date sets.
   - Label-based slicing, indexing and subsetting of large data sets.
   - Columns from a data structure can be deleted or inserted.
   - Group by data for aggregation and transformations.
   - High performance merging and joining of data.
   - Time Series functionality.

# Installation using pip





![009_PIP.jpg](attachment:009_PIP.jpg)




# Installation using Anaconda

![conda%20panda.png](attachment:conda%20panda.png)



- Once the installation is completed, go to your IDE (Jupyter, PyCharm etc.) and simply import it by typing: “import pandas as pd”

# Python Pandas Operations

![Operations-0-1.png](attachment:Operations-0-1.png)



# Dropping Rows And Columns In pandas Dataframe


In [21]:
import pandas as pd

In [20]:
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
        'year': [2012, 2012, 2013, 2014, 2014], 
        'reports': [4, 24, 31, 2, 3]}

df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df

Unnamed: 0,name,year,reports
Cochice,Jason,2012,4
Pima,Molly,2012,24
Santa Cruz,Tina,2013,31
Maricopa,Jake,2014,2
Yuma,Amy,2014,3


In [22]:
#Drop an observation (row)

df.drop(['Cochice', 'Pima'])

Unnamed: 0,name,year,reports
Santa Cruz,Tina,2013,31
Maricopa,Jake,2014,2
Yuma,Amy,2014,3


In [23]:
#Drop a variable (column)
#Note: axis=1 denotes that we are referring to a column, not a row

df.drop('reports', axis=1)

Unnamed: 0,name,year
Cochice,Jason,2012
Pima,Molly,2012
Santa Cruz,Tina,2013
Maricopa,Jake,2014
Yuma,Amy,2014


In [26]:
# Drop a row if it contains a certain value (in this case, “Tina”)
#Specifically: Create a new dataframe called df that includes all rows where the value of a cell in the name column does not equal “Tina”

df[df.name != 'Tina']

Unnamed: 0,name,year,reports
Cochice,Jason,2012,4
Pima,Molly,2012,24
Maricopa,Jake,2014,2
Yuma,Amy,2014,3


In [27]:
#Drop a row by row number (in this case, row 3)
#Note that Pandas uses zero based numbering, so 0 is the first row, 1 is the second row, etc.

df.drop(df.index[2])

Unnamed: 0,name,year,reports
Cochice,Jason,2012,4
Pima,Molly,2012,24
Maricopa,Jake,2014,2
Yuma,Amy,2014,3


In [28]:
# can be extended to dropping a range

df.drop(df.index[[2,3]])

Unnamed: 0,name,year,reports
Cochice,Jason,2012,4
Pima,Molly,2012,24
Yuma,Amy,2014,3


In [37]:
# drop only second last row

df.drop(df.index[-2])


Unnamed: 0,name,year,reports
Cochice,Jason,2012,4
Pima,Molly,2012,24
Santa Cruz,Tina,2013,31
Yuma,Amy,2014,3


In [39]:
df[:-2] #drop bottom 2 rows 

Unnamed: 0,name,year,reports
Cochice,Jason,2012,4
Pima,Molly,2012,24
Santa Cruz,Tina,2013,31


# Change the index

    - Pandas set_index() is a method to set a List, Series or Data frame as index of a Data Frame.


Parameters:

![set_index.png](attachment:set_index.png)

In [88]:
# import dataset
data=pd.read_csv("C:/Users\i life/Downloads/employees.csv")

In [89]:
data.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,8/6/1993,12:42 PM,97308,6.945,True,Marketing
1,Thomas,Male,3/31/1996,6:53 AM,61933,4.17,True,
2,Maria,Female,4/23/1993,11:17 AM,130590,11.858,False,Finance
3,Jerry,Male,3/4/2005,1:00 PM,138705,9.34,True,Finance
4,Larry,Male,1/24/1998,4:47 PM,101004,1.389,True,Client Services


In [92]:
# setting first name as index column 

data.set_index("First Name", inplace = True) 
data.head()

Unnamed: 0_level_0,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
First Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Douglas,Male,8/6/1993,12:42 PM,97308,6.945,True,Marketing
Thomas,Male,3/31/1996,6:53 AM,61933,4.17,True,
Maria,Female,4/23/1993,11:17 AM,130590,11.858,False,Finance
Jerry,Male,3/4/2005,1:00 PM,138705,9.34,True,Finance
Larry,Male,1/24/1998,4:47 PM,101004,1.389,True,Client Services


In [44]:
df= pd.DataFrame({"Day":[1,2,3,4], "Visitors":[200, 100,230,300], "Bounce_Rate":[20,45,60,10]}) 
df



Unnamed: 0,Day,Visitors,Bounce_Rate
0,1,200,20
1,2,100,45
2,3,230,60
3,4,300,10


In [45]:
#set index as day column

df.set_index("Day", inplace= True)
 
df

Unnamed: 0_level_0,Visitors,Bounce_Rate
Day,Unnamed: 1_level_1,Unnamed: 2_level_1
1,200,20
2,100,45
3,230,60
4,300,10


# Change the column header
    - Pandas rename() method is used to rename any index, column or row. Renaming of column can also be done by dataframe.columns = [#list].

In [46]:

#import datase

data=pd.read_csv("C:/Users\i life/Downloads/employees.csv")
data.head()



Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,8/6/1993,12:42 PM,97308,6.945,True,Marketing
1,Thomas,Male,3/31/1996,6:53 AM,61933,4.17,True,
2,Maria,Female,4/23/1993,11:17 AM,130590,11.858,False,Finance
3,Jerry,Male,3/4/2005,1:00 PM,138705,9.34,True,Finance
4,Larry,Male,1/24/1998,4:47 PM,101004,1.389,True,Client Services


In [47]:
# changing columns with rename()
new_data = data.rename(columns = {"First Name": "Name", 
                                  "Team":"Team name",}) 
new_data.head()

Unnamed: 0,Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team name
0,Douglas,Male,8/6/1993,12:42 PM,97308,6.945,True,Marketing
1,Thomas,Male,3/31/1996,6:53 AM,61933,4.17,True,
2,Maria,Female,4/23/1993,11:17 AM,130590,11.858,False,Finance
3,Jerry,Male,3/4/2005,1:00 PM,138705,9.34,True,Finance
4,Larry,Male,1/24/1998,4:47 PM,101004,1.389,True,Client Services


In [111]:
# changing columns using .columns() 
# Even if one column has to be changed, full column list has to be passed.

data.columns = ['Name', 'Gender', 'date', 
                'last login time', 'salary', 'Bonus', 'senior management','team'] 
data.head()

Unnamed: 0,Name,Gender,date,last login time,salary,Bonus,senior management,team
0,Douglas,Male,8/6/1993,12:42 PM,97308,6.945,True,Marketing
1,Thomas,Male,3/31/1996,6:53 AM,61933,4.17,True,
2,Maria,Female,4/23/1993,11:17 AM,130590,11.858,False,Finance
3,Jerry,Male,3/4/2005,1:00 PM,138705,9.34,True,Finance
4,Larry,Male,1/24/1998,4:47 PM,101004,1.389,True,Client Services


# Python Pandas - Missing Data

   - Missing data is always a problem in real life scenarios. 
   Areas like machine learning and data mining face severe issues in the accuracy of their model predictions because of poor   quality of data caused by missing values.
   In these areas, missing value treatment is a major point of focus to make their models more accurate and valid.

In [17]:
import pandas as pd
import numpy as np


In [18]:
#Create dataframe with missing values
raw_data = {'first_name': ['Jason', np.nan, 'Tina', 'Jake', 'Amy'], 
            'last_name': ['Miller', np.nan, 'Ali', 'Milner', 'Cooze'], 
            'age': [42, np.nan, 36, 24, 73], 
            'sex': ['m', np.nan, 'f', 'm', 'f'], 
            'preTestScore': [4, np.nan, np.nan, 2, 3],
            'postTestScore': [25, np.nan, np.nan, 62, 70]}

df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex', 'preTestScore', 'postTestScore'])
df

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore
0,Jason,Miller,42.0,m,4.0,25.0
1,,,,,,
2,Tina,Ali,36.0,f,,
3,Jake,Milner,24.0,m,2.0,62.0
4,Amy,Cooze,73.0,f,3.0,70.0


- Check for Missing Values

  To make detecting missing values, Pandas provides the isnull() and notnull() functions

In [156]:
print(df.isnull())

   first_name  last_name    age    sex  preTestScore  postTestScore
0       False      False  False  False         False          False
1        True       True   True   True          True           True
2       False      False  False  False          True           True
3       False      False  False  False         False          False
4       False      False  False  False         False          False


In [157]:
print(df.notnull())

   first_name  last_name    age    sex  preTestScore  postTestScore
0        True       True   True   True          True           True
1       False      False  False  False         False          False
2        True       True   True   True         False          False
3        True       True   True   True          True           True
4        True       True   True   True          True           True


- Calculations with Missing Data

 -When summing data, NA will be treated as Zero
 
 -If the data are all NA, then the result will be NA

In [168]:
df['age'].sum() #null values treated as zero

175.0

In [161]:
df.isna().sum()

first_name       1
last_name        1
age              1
sex              1
preTestScore     2
postTestScore    2
dtype: int64

In [169]:
# Handling missing values
# 1.Drop missing observations

df_no_missing = df.dropna()
df_no_missing


Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore
0,Jason,Miller,42.0,m,4.0,25.0
3,Jake,Milner,24.0,m,2.0,62.0
4,Amy,Cooze,73.0,f,3.0,70.0


In [171]:
#Create a new column full of missing values

df['location'] = np.nan
df

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore,location
0,Jason,Miller,42.0,m,4.0,25.0,
1,,,,,,,
2,Tina,Ali,36.0,f,,,
3,Jake,Milner,24.0,m,2.0,62.0,
4,Amy,Cooze,73.0,f,3.0,70.0,


In [19]:

#Drop row if they only contain missing values

df.dropna(axis=0,how='all')


Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore
0,Jason,Miller,42.0,m,4.0,25.0
2,Tina,Ali,36.0,f,,
3,Jake,Milner,24.0,m,2.0,62.0
4,Amy,Cooze,73.0,f,3.0,70.0


In [172]:
#Drop column if they only contain missing values

df.dropna(axis=1, how='all')

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore
0,Jason,Miller,42.0,m,4.0,25.0
1,,,,,,
2,Tina,Ali,36.0,f,,
3,Jake,Milner,24.0,m,2.0,62.0
4,Amy,Cooze,73.0,f,3.0,70.0


In [181]:
#Drop rows that contain less than five observations
#This is really mostly useful for time series

df.dropna(thresh=5)

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore,location
0,Jason,Miller,42.0,m,4.0,25.0,
3,Jake,Milner,24.0,m,2.0,62.0,
4,Amy,Cooze,73.0,f,3.0,70.0,


In [182]:
#Fill in missing data with zeros
df.fillna(0)

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore,location
0,Jason,Miller,42.0,m,4.0,25.0,0.0
1,0,0,0.0,0,0.0,0.0,0.0
2,Tina,Ali,36.0,f,0.0,0.0,0.0
3,Jake,Milner,24.0,m,2.0,62.0,0.0
4,Amy,Cooze,73.0,f,3.0,70.0,0.0


In [183]:
#Fill in missing with the mean value 
#inplace=True means that the changes are saved to the df right away

df["preTestScore"].fillna(df["preTestScore"].mean(), inplace=True)
df

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore,location
0,Jason,Miller,42.0,m,4.0,25.0,
1,,,,,3.0,,
2,Tina,Ali,36.0,f,3.0,,
3,Jake,Milner,24.0,m,2.0,62.0,
4,Amy,Cooze,73.0,f,3.0,70.0,


In [187]:
#Select some raws but ignore the missing data points
# Select the rows of df where age is not NaN and sex is not NaN

df[df['age'].notnull() & df['sex'].notnull()]

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore,location
0,Jason,Miller,42.0,m,4.0,25.0,
2,Tina,Ali,36.0,f,3.0,,
3,Jake,Milner,24.0,m,2.0,62.0,
4,Amy,Cooze,73.0,f,3.0,70.0,


In [49]:
df = pd.read_csv('Data_preprocessing.zip', compression='zip') 

FileNotFoundError: [Errno 2] No such file or directory: 'Data_preprocessing.zip'