# 1. Basics of Pandas for data manipulation

A. Series and DataFrames
Both series and DataFrames are Pandas Data structures.

Series is like one dimensional NumPy array with axis labels.

DataFrame is multidimensional NumPy array with labels on rows and columns.

Working with NumPy, we saw that it supports numeric type data. Pandas on other hand supports whole range of data types, from numeric to strings, etc..

In [None]:
# importing numpy and pandas

import numpy as np
import pandas as pd

###### Creating Series
Series can be created from a Python list, dictionary, and NumPy array.
if we dont mention index by default it takes from 0 
if we provide index those indexes will be used.

In [4]:
# Creating the series from a Python list

num_list = [1,2,3,4,5]

pd.Series(num_list)

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [5]:
week_days = ['Mon','Tues','Wed','Thur','Fri']

pd.Series(week_days, index=["a", "b", "c", "d", "e"])

a     Mon
b    Tues
c     Wed
d    Thur
e     Fri
dtype: object

In [6]:
# Creating the Series from dictionary 

countries_code = { 1:"United States",
                 91:"India",
                 49:"Germany",
                 86:"China",
                250:"Rwanda"}

pd.Series(countries_code)

1      United States
91             India
49           Germany
86             China
250           Rwanda
dtype: object

In [7]:
# Creating the Series from NumPy array
# We peovide the list of indexes
# if we don't provide the indexes, the default indexes are numbers...starts from 0,1,2..

arr = np.array ([1, 2, 3, 4, 5])
pd.Series(arr)


0    1
1    2
2    3
3    4
4    5
dtype: int32

In [8]:
pd.Series(arr, index=['a', 'b', 'c', 'd', 'e'])

a    1
b    2
c    3
d    4
e    5
dtype: int32

#### Creating DataFrames

DataFrames are the most used Pandas data structure. It can be created from a dictionary, 2D array, and Series. 

In [9]:
# Creating DataFrame from a dictionary
# keys are the columns and values are column values.
# Note : In column Name all the values stored as seris .. if we print df['Name'] it will give series 
countries = {'Name': ['USA', 'India', 'German', 'Rwanda'], 
             
             'Codes':[1, 91, 49, 250] }

pd.DataFrame(countries)

Unnamed: 0,Name,Codes
0,USA,1
1,India,91
2,German,49
3,Rwanda,250


In [10]:
# Creating a dataframe from a 2D array
# You pass the list of columns

array_2d = np.array ([[1,2,3], [4,5,6], [7,8,9]])

pd.DataFrame(array_2d, columns = ['column 1', 'column 2', 'column 3'])

Unnamed: 0,column 1,column 2,column 3
0,1,2,3
1,4,5,6
2,7,8,9


In [14]:
# Creating a dataframe from Pandas series 
# Pass the columns in a list

countries_code = { "United States": 1,
                 "India": 91,
                 "Germany": 49,
                 "China": 86,
                 "Rwanda":250}

pd_series = pd.Series(countries_code)
df = pd.DataFrame(pd_series, columns = ['Codes'])
df

Unnamed: 0,Codes
United States,1
India,91
Germany,49
China,86
Rwanda,250


In [29]:
# Adding a column
# Number in population are pretty random

df ['Population'] = [100, 450, 575, 5885, 533]



In [30]:
df.drop('Population', axis =1)

Unnamed: 0,Codes
United States,1
India,91
Germany,49
China,86
Rwanda,250


In [31]:
df.columns

Index(['Codes', 'Population'], dtype='object')

Note: Even after dropping why again population column still shwing
1)You are dropping the column as expected but you have to assign the new data frame to your original data frame so it's overwritten. So,we have to write like

df=df.drop('Population', axis =1)
df.columns
output:
Index(['Codes'], dtype='object')

In [18]:
df.keys

<bound method NDFrame.keys of                Codes  Population
United States      1         100
India             91         450
Germany           49         575
China             86        5885
Rwanda           250         533>

In [19]:
df.index

Index(['United States', 'India', 'Germany', 'China', 'Rwanda'], dtype='object')

###### B. Data Indexing and Selection

Indexing and selection works in both Series and Dataframe.

Because DataFrame is made of Series, let's focus on how to select data in DataFrame. 

In [33]:
# Creating DataFrame from a dictionary
# as we know we can create dataframe with dictionary.. and even we can assign index also if we dont assign default index takes
#from zero

countries = {'Name': ['USA', 'India', 'German', 'Rwanda'], 
             
             'Codes':[1, 91, 49, 250] }

df = pd.DataFrame(countries, index=['a', 'b', 'c', 'd'])
df

Unnamed: 0,Name,Codes
a,USA,1
b,India,91
c,German,49
d,Rwanda,250


In [34]:
#printing single column data
df['Name']

a       USA
b     India
c    German
d    Rwanda
Name: Name, dtype: object

In [35]:
#printing more than one column data - we will pass column names in a list
df [['Name', 'Codes']]

Unnamed: 0,Name,Codes
a,USA,1
b,India,91
c,German,49
d,Rwanda,250


###### Row selection
1)loc- we can use loc to select data by the label indexes - means exact with index names 
2)iloc - to select by default integer index (or by the position of the row)


In [36]:
#without using loc and iloc 
# This will return the first two rows
df [0:2]

Unnamed: 0,Name,Codes
a,USA,1
b,India,91


In [41]:
#selecting rows from index a to c by passing index names
#df.loc['a'] # selecting only one row
df.loc['a':'c']

Unnamed: 0,Name,Codes
a,USA,1
b,India,91
c,German,49


In [42]:
df [:'b']

Unnamed: 0,Name,Codes
a,USA,1
b,India,91


In [43]:
# index starts from zero so skipping 0th row
df.iloc[1:3]

Unnamed: 0,Name,Codes
b,India,91
c,German,49


In [44]:
df.iloc[2:] # from 2nd index to till end

Unnamed: 0,Name,Codes
c,German,49
d,Rwanda,250


In [45]:
df.iloc[:2] # from o th index to till 2 nd

Unnamed: 0,Name,Codes
a,USA,1
b,India,91


# Conditional Selection

In [46]:
df

Unnamed: 0,Name,Codes
a,USA,1
b,India,91
c,German,49
d,Rwanda,250


In [48]:
#Let's select a country with code 49
df[df['Codes']==49]

Unnamed: 0,Name,Codes
c,German,49


In [49]:
df [df['Codes'] < 250 ]

Unnamed: 0,Name,Codes
a,USA,1
b,India,91
c,German,49


In [50]:
df [df['Name'] =='USA' ]

Unnamed: 0,Name,Codes
a,USA,1


In [57]:
# You can use and (&) or (|) for more than conditions
#df [(condition 1) & (condition 2)]

#df [(df['Codes'] == 91 ) & (df['Name'] == 'India') ]
#df[(df['Codes']==91)| (df['Name']=='India')]
df[(df['Codes']==91) & (df['Name']=='India')]

Unnamed: 0,Name,Codes
b,India,91


In [58]:
# isin() return false or true when provided value is included in dataframe
sample_codes_names=[1,3,250, 'USA', 'India', 'England']

df.isin(sample_codes_names)

Unnamed: 0,Name,Codes
a,True,True
b,True,False
c,False,False
d,False,True


###### Data frame Iteration:
df.items() #Iterate over (column name, Series) pairs.
df.iteritems() Iterate over (column name, Series) pairs.
DataFrame.iterrows() Iterate over DataFrame rows as (index, Series) pairs.
DataFrame.itertuples([index, name]) Iterate over DataFrame rows as namedtuples.

In [59]:
df2 = pd.DataFrame(np.array ([[1,2,3], [4,5,6], [7,8,9]]), 
                   columns = ['column 1', 'column 2', 'column 3'])

df2

Unnamed: 0,column 1,column 2,column 3
0,1,2,3
1,4,5,6
2,7,8,9


In [60]:
# Iterate over (column name, Series) pairs.

for col_name, content in df2.items():
    print(col_name)
    print(content)

column 1
0    1
1    4
2    7
Name: column 1, dtype: int32
column 2
0    2
1    5
2    8
Name: column 2, dtype: int32
column 3
0    3
1    6
2    9
Name: column 3, dtype: int32


In [61]:
# Iterate over DataFrame rows as (index, Series) pairs

for row in df2.iterrows():
    print(row)

(0, column 1    1
column 2    2
column 3    3
Name: 0, dtype: int32)
(1, column 1    4
column 2    5
column 3    6
Name: 1, dtype: int32)
(2, column 1    7
column 2    8
column 3    9
Name: 2, dtype: int32)


###### Dealing with Missing data
1)Real world datasets are messy, often with missing values. 
2)Pandas replace NaN with missing values by default. NaN stands for not a number.
3)Missing values can either be ignored, droped or filled.

In [63]:
# Creating a dataframe
#np.nan- The numpy nan is the IEEE 754 floating-point representation of Not a Number

df3 = pd.DataFrame(np.array ([[1,2,3], [4,np.nan,6], [7,np.nan,np.nan]]), 
                   columns = ['column 1', 'column 2', 'column 3'])
df3

Unnamed: 0,column 1,column 2,column 3
0,1.0,2.0,3.0
1,4.0,,6.0
2,7.0,,


In [64]:
# Recognizing the missing values

df3.isnull()

Unnamed: 0,column 1,column 2,column 3
0,False,False,False
1,False,True,False
2,False,True,True


In [65]:
# Calculating number of the missing values in each feature

df3.isnull().sum()

column 1    0
column 2    2
column 3    1
dtype: int64

In [66]:
# Recognizng non missig values

df3.notna()

Unnamed: 0,column 1,column 2,column 3
0,True,True,True
1,True,False,True
2,True,False,False


In [67]:
df3.notna().sum()

column 1    3
column 2    1
column 3    2
dtype: int64

###### Removing the missing values

In [68]:
## Dropping missing values 
#All rows are deleted because dropna() will remove each row which have missing value. 

df3.dropna()

Unnamed: 0,column 1,column 2,column 3
0,1.0,2.0,3.0


In [69]:
# you can drop NaNs in specific column(s)

df3['column 3'].dropna()

0    3.0
1    6.0
Name: column 3, dtype: float64

In [70]:
# You can drop data by axis 
# Axis = 1...drop all columns with Nans
# df3.dropna(axis='columns')

df3.dropna(axis=1)


Unnamed: 0,column 1
0,1.0
1,4.0
2,7.0


In [71]:
# axis = 0...drop all rows with Nans
# df3.dropna(axis='rows') is same 

df3.dropna(axis=0)

Unnamed: 0,column 1,column 2,column 3
0,1.0,2.0,3.0


###### Filling missing values:

In [72]:
# Filling Missing values

df3.fillna(10)

Unnamed: 0,column 1,column 2,column 3
0,1.0,2.0,3.0
1,4.0,10.0,6.0
2,7.0,10.0,10.0


###### More Operations

In [73]:
df4 = pd.DataFrame({'Product Name':['Shirt','Boot','Bag'], 
              'Order Number':[45,56,64], 
              'Total Quantity':[10,5,9]}, 
              columns = ['Product Name', 'Order Number', 'Total Quantity'])

In [75]:
#Retrieving basic info about the Dataframe
df4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Product Name    3 non-null      object
 1   Order Number    3 non-null      int64 
 2   Total Quantity  3 non-null      int64 
dtypes: int64(2), object(1)
memory usage: 200.0+ bytes


In [76]:
# Return dataframe columns

df4.columns

Index(['Product Name', 'Order Number', 'Total Quantity'], dtype='object')

In [77]:
# Return the size or number of elements in a dataframe

df4.size

9

In [78]:
# Return the shape

df4.shape

(3, 3)

In [79]:
# Return unique values in a given column 

df4['Product Name'].unique()

array(['Shirt', 'Boot', 'Bag'], dtype=object)

In [80]:
# Return a number of unique values
df4['Product Name'].nunique()

3

In [81]:
# Counting the occurence of each value in a column 

df4['Product Name'].value_counts()

Bag      1
Shirt    1
Boot     1
Name: Product Name, dtype: int64

###### Aggregation methods:

In [82]:
df4

Unnamed: 0,Product Name,Order Number,Total Quantity
0,Shirt,45,10
1,Boot,56,5
2,Bag,64,9


In [83]:
df4.describe()

Unnamed: 0,Order Number,Total Quantity
count,3.0,3.0
mean,55.0,8.0
std,9.539392,2.645751
min,45.0,5.0
25%,50.5,7.0
50%,56.0,9.0
75%,60.0,9.5
max,64.0,10.0


In [84]:
# Mode of the dataframe
# Mode is the most recurring values

df4['Total Quantity'].mode()

0     5
1     9
2    10
dtype: int64

In [85]:
# The maximum value

df4['Total Quantity'].max()

10

In [86]:
# The minimum value

df4['Total Quantity'].min()

5

In [87]:
# The mean

df4['Total Quantity'].mean()

8.0

In [88]:
# The median value in a dataframe

df4['Total Quantity'].median()

9.0

In [89]:
# Variance 

df4['Total Quantity'].var()

7.0

In [90]:
# Sum of all values in a column

df4['Total Quantity'].sum()

24

###### Group by:
Group by involves splitting data into groups, applying function to each group, and combining the results.

In [91]:
df4 = pd.DataFrame({'Product Name':['Shirt','Boot','Bag', 'Ankle', 'Pullover', 'Boot', 'Ankle', 'Tshirt', 'Shirt'], 
              'Order Number':[45,56,64, 34, 67, 56, 34, 89, 45], 
              'Total Quantity':[10,5,9, 11, 11, 8, 14, 23, 10]}, 
              columns = ['Product Name', 'Order Number', 'Total Quantity'])

In [92]:
df4

Unnamed: 0,Product Name,Order Number,Total Quantity
0,Shirt,45,10
1,Boot,56,5
2,Bag,64,9
3,Ankle,34,11
4,Pullover,67,11
5,Boot,56,8
6,Ankle,34,14
7,Tshirt,89,23
8,Shirt,45,10


In [96]:
# Let's group the df by product name

df4.groupby('Product Name').mean()


Unnamed: 0_level_0,Order Number,Total Quantity
Product Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Ankle,34.0,12.5
Bag,64.0,9.0
Boot,56.0,6.5
Pullover,67.0,11.0
Shirt,45.0,10.0
Tshirt,89.0,23.0


###### combining data sets
1)concat
2)apend
3)merge.