## AIM:
To study the opeations of pandas in python

Prathamesh Ingale

Pandas: 
Pandas is a popular open-source library in Python that is widely used for data manipulation, analysis, and visualization. It provides a powerful set of tools for working with structured data, including data frames and series, which make it easier to analyze and visualize complex datasets. Pandas is built on top of Numpy, another popular library in Python for scientific computing, and it is highly optimized for performance.

Pandas offers a comprehensive set of functions for data wrangling, cleaning, and preparation, including missing data handling, reshaping, grouping, and aggregating. It also provides tools for data exploration, such as filtering, sorting, and selecting data based on various criteria.

Why we use Pandas?
One of the main benefits of Pandas is its ability to handle large datasets efficiently. The library allows for efficient data handling and processing, and its functions are optimized for speed and memory efficiency. Pandas also integrates well with other Python libraries for data analysis and visualization, such as Matplotlib and Seaborn, which makes it a popular choice for data scientists and analysts.

Pandas is particularly useful for working with tabular or structured data, such as data stored in CSV, Excel, or SQL databases. The library provides a simple and intuitive interface for loading data from various sources, such as local files or remote databases. Once the data is loaded into a Pandas data frame, it becomes easy to manipulate and analyze using the library's powerful functions.

Another advantage of Pandas is its flexibility and extensibility. The library provides many built-in functions for data manipulation and analysis, but it also allows for custom functions to be defined and applied to the data. This makes it easy to create custom data analysis pipelines that fit specific requirements and use cases.

Pandas also provides excellent support for time series data analysis, which is essential in many fields, such as finance, economics, and engineering. The library provides specialized functions for time series manipulation, such as resampling, shifting, and rolling windows, which make it easier to analyze and visualize time series data.



In [1]:
import pandas as prathamesh_pd

In [2]:
a = [1,4,5]

myvar = prathamesh_pd.Series(a)

print(myvar)

0    1
1    4
2    5
dtype: int64


## Labels

In Pandas, labels are used to give names to rows and columns in a DataFrame, allowing for easier manipulation and analysis of the data. They can be created using the syntax "df.columns = \['label1', 'label2', ...\]" for column labels and "df.index = \['label1', 'label2', ...\]" for row labels, where "df" is the DataFrame object and "label1", "label2", etc. are the desired labels.

In [3]:
print(myvar[1])

4


## Creating Labels

In [4]:
a = [1,4,5]
myvar = prathamesh_pd.Series(a,index=["a","b","c"])
print(myvar)

a    1
b    4
c    5
dtype: int64


In [5]:
print(myvar["a"])

1


## NP Arrays to Series 
Numpy Arrays as well as python lists and dictionaries can be easily converted into pandas Series format.

In [6]:
import numpy as np

In [7]:
arr = np.array([10,20,30])
prathamesh_pd.Series(arr)

0    10
1    20
2    30
dtype: int64

In [8]:
labels = ["a","b","c"]
prathamesh_pd.Series(arr,labels)

a    10
b    20
c    30
dtype: int64

## Dictionary to Series

In [9]:
d = {'a':10,'b':20,'c':30}
prathamesh_pd.Series(d)

a    10
b    20
c    30
dtype: int64

## Using an index

In [10]:
ser1 = prathamesh_pd.Series([1,2,3,4],index=['USA','Germany','USSR','Japan'])

In [11]:
ser1

USA        1
Germany    2
USSR       3
Japan      4
dtype: int64

In [12]:
ser2 = prathamesh_pd.Series([1,2,5,4],index= ['USA', 'Germany', 'Italy', 'Japan'])

In [13]:
ser2

USA        1
Germany    2
Italy      5
Japan      4
dtype: int64

In [14]:
ser1['Japan']

4

In [15]:
ser1+ser2

Germany    4.0
Italy      NaN
Japan      8.0
USA        2.0
USSR       NaN
dtype: float64

## DataFrames

In [16]:
data = {
    "Age":[19,18,22,24,25,26,27,28,29,30],
    "Height" :[120,140,162,186,172,174,177,187,166,155]
}
prathamesh_df = prathamesh_pd.DataFrame(data=data)

In [17]:
prathamesh_df

Unnamed: 0,Age,Height
0,19,120
1,18,140
2,22,162
3,24,186
4,25,172
5,26,174
6,27,177
7,28,187
8,29,166
9,30,155


In [18]:
prathamesh_df = prathamesh_pd.DataFrame(data=data, 
                             index=[("Age"+str(i)) for i in range(1,len(data["Age"])+1)])
prathamesh_df

Unnamed: 0,Age,Height
Age1,19,120
Age2,18,140
Age3,22,162
Age4,24,186
Age5,25,172
Age6,26,174
Age7,27,177
Age8,28,187
Age9,29,166
Age10,30,155


In [19]:
np.random.seed(101)

In [20]:
prathamesh_df  = prathamesh_pd.DataFrame(data=np.random.randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())

In [21]:
prathamesh_df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


## Selection of Indexes in pandas

In Pandas dataframes, we can use the index operator selection to select rows and columns by specifying their labels or integer positions within square brackets. For example, to select a column with label 'W', we can use df\['W'\]. This returns a pandas series object that contains the values of the 'W' column.

We can also select multiple columns by passing a list of labels within square brackets. For example, to select columns 'W' and 'X', we can use df\[\['W', 'X'\]\]. This returns a dataframe that contains only the 'W' and 'X' columns.

To select rows based on their integer position, we can use the same index operator selection by passing an integer or a slice object within square brackets. For example, to select the first row, we can use df\[0:1\], and to select the first three rows, we can use df\[0:3\].

It's important to note that when selecting rows using index operator selection with a slice object, the end position is not inclusive. In other words, df\[0:3\] selects rows 0, 1, and 2, but not 3.

In addition to selecting rows and columns, we can also use index operator selection to set values in a dataframe. For example, to set the value of the 'age' column for the first row to 35, we can use df.loc\[0, 'age'\] = 35.

loc is a label-based indexer that selects rows and columns based on their labels. It accepts row and column labels as input and returns a new dataframe containing the specified rows and columns. For example, df.loc\[\[0, 1\], \['name', 'age'\]\] returns a new dataframe with rows 0 and 1 and columns 'name' and 'age'. The syntax for loc is df.loc\[row_selector, column_selector\], where row_selector and column_selector can be labels, lists of labels, slices, or boolean arrays.

iloc is an integer-based indexer that selects rows and columns based on their integer positions. It accepts row and column positions as input and returns a new dataframe containing the specified rows and columns. For example, df.iloc\[\[0, 1\], \[0, 1\]\] returns a new dataframe with the first two rows and columns. The syntax for iloc is similar to loc, but uses integer positions instead of labels: df.iloc\[row_selector, column_selector\].

One important difference between loc and iloc is their handling of slicing. loc includes both the start and end labels in the slice, while iloc includes the start position but excludes the end position. For example, df.loc\[0:2, 'name':'age'\] selects rows with labels 0, 1, and 2, and columns with labels 'name', 'age', and everything in between. However, df.iloc\[0:2, 0:2\] selects the first and second rows and columns, but not the third row or column.

Another difference between loc and iloc is that iloc does not support boolean arrays as selectors, while loc does. Boolean arrays are a powerful way to select rows or columns based on a condition, and can be very useful for data manipulation.

In summary, loc and iloc are two important indexing methods in Pandas that allow us to select rows and columns in a dataframe based on their labels or positions, respectively. The choice between loc and iloc depends on the type of indexing used in the dataframe and the requirements of the specific data manipulation task.

In [22]:
prathamesh_df['W']

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [23]:
prathamesh_df.W # SQL Syntax, Not Recommended

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [24]:
type(prathamesh_df['W'])

pandas.core.series.Series

In [25]:
prathamesh_df['new'] = prathamesh_df['W']+prathamesh_df['X']

Here We add all values from each row and display the sum in rows.

In [26]:
prathamesh_df['new']

A    3.334983
B    0.331800
C   -1.278046
D   -0.570177
E    2.169552
Name: new, dtype: float64

In [27]:
prathamesh_df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.334983
B,0.651118,-0.319318,-0.848077,0.605965,0.3318
C,-2.018168,0.740122,0.528813,-0.589001,-1.278046
D,0.188695,-0.758872,-0.933237,0.955057,-0.570177
E,0.190794,1.978757,2.605967,0.683509,2.169552


## Drop
In Pandas, the drop() method is used to remove rows or columns from a dataframe. The drop() method takes two important parameters - axis and inplace.

The axis parameter specifies whether to drop rows or columns. The value of axis can be set to 0 or 'index' to drop rows, and 1 or 'columns' to drop columns. For example, to drop the 'age' column from a dataframe df, we can use df.drop('age', axis=1).

The inplace parameter is an optional parameter that modifies the dataframe directly instead of returning a new dataframe. If inplace is set to True, the original dataframe is modified and no copy is returned. If inplace is set to False (default value), a new dataframe with the specified rows or columns removed is returned. For example, to drop the 'age' column from a dataframe df and modify it in place, we can use df.drop('age', axis=1, inplace=True).

When dropping rows or columns using drop(), the original dataframe is not modified by default. Instead, a new dataframe with the specified rows or columns removed is returned. For example, to drop the first row of a dataframe df, we can use df.drop(0).

It's important to note that when dropping rows or columns using drop(), a new dataframe is returned by default, even if the inplace parameter is not set to True. To modify the original dataframe, we need to set inplace=True.

In [28]:
prathamesh_df.drop('new',axis=1)

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


Drop used to remove a column with the given name. Throws exception if not found. 

In [29]:
prathamesh_df # Does not drop as it is not inplace.

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.334983
B,0.651118,-0.319318,-0.848077,0.605965,0.3318
C,-2.018168,0.740122,0.528813,-0.589001,-1.278046
D,0.188695,-0.758872,-0.933237,0.955057,-0.570177
E,0.190794,1.978757,2.605967,0.683509,2.169552


In [30]:
prathamesh_df.drop('new',axis=1, inplace=True)

In [31]:
prathamesh_df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [32]:
prathamesh_df.drop('E',axis=0)

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057


In [33]:
prathamesh_df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [34]:
prathamesh_df.loc['A']

W    2.706850
X    0.628133
Y    0.907969
Z    0.503826
Name: A, dtype: float64

In [35]:
prathamesh_df.loc[['A']]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826


In [36]:
prathamesh_df.loc[:,'W']

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [37]:
prathamesh_df.loc[:,'W']

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [38]:
prathamesh_df.loc['A':'C','W']

A    2.706850
B    0.651118
C   -2.018168
Name: W, dtype: float64

In [39]:
prathamesh_df.loc['B','Y']

-0.8480769834036315

In [40]:
prathamesh_df.loc[['A','B'],['X','Y']]

Unnamed: 0,X,Y
A,0.628133,0.907969
B,-0.319318,-0.848077


In [41]:
prathamesh_df.iloc[1:3,3]

B    0.605965
C   -0.589001
Name: Z, dtype: float64

In [42]:
prathamesh_df.iloc[2]

W   -2.018168
X    0.740122
Y    0.528813
Z   -0.589001
Name: C, dtype: float64

In [43]:
prathamesh_df.iloc[:,2]

A    0.907969
B   -0.848077
C    0.528813
D   -0.933237
E    2.605967
Name: Y, dtype: float64

## Conditional Selection
Conditional selection, also known as Boolean indexing, is a powerful feature in Python that allows us to select rows from a dataset based on certain conditions. In Pandas, we can use conditional selection to filter rows from a DataFrame based on a condition that evaluates to True or False. We can use logical operators such as & (and), | (or), and ~ (not) to create complex conditions. We can also use comparison operators such as <, >, ==, and != to compare values. Conditional selection allows us to extract specific subsets of data from a larger dataset for further analysis or processing.

In [44]:
prathamesh_df>0

Unnamed: 0,W,X,Y,Z
A,True,True,True,True
B,True,False,False,True
C,False,True,True,False
D,True,False,False,True
E,True,True,True,True


In [45]:
prathamesh_df>0

Unnamed: 0,W,X,Y,Z
A,True,True,True,True
B,True,False,False,True
C,False,True,True,False
D,True,False,False,True
E,True,True,True,True


In [46]:
prathamesh_df[prathamesh_df>0]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,,,0.605965
C,,0.740122,0.528813,
D,0.188695,,,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [47]:
prathamesh_df[prathamesh_df['W']>0]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [48]:
prathamesh_df[prathamesh_df['W']>0]['Y'] 
# Value of Y Where W is greater than 0

A    0.907969
B   -0.848077
D   -0.933237
E    2.605967
Name: Y, dtype: float64

## Reseting indexes and split
We can set and modify indexes using the index param in dataframe. We can set it usint the set_index function and reset it to 0... using reset_index function.

In [49]:
prathamesh_df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [50]:
prathamesh_df.reset_index() 

Unnamed: 0,index,W,X,Y,Z
0,A,2.70685,0.628133,0.907969,0.503826
1,B,0.651118,-0.319318,-0.848077,0.605965
2,C,-2.018168,0.740122,0.528813,-0.589001
3,D,0.188695,-0.758872,-0.933237,0.955057
4,E,0.190794,1.978757,2.605967,0.683509


In [51]:
new_index ='MH OD GJ MP DL'.split()

In [52]:
prathamesh_df['States'] = new_index

In [53]:
prathamesh_df

Unnamed: 0,W,X,Y,Z,States
A,2.70685,0.628133,0.907969,0.503826,MH
B,0.651118,-0.319318,-0.848077,0.605965,OD
C,-2.018168,0.740122,0.528813,-0.589001,GJ
D,0.188695,-0.758872,-0.933237,0.955057,MP
E,0.190794,1.978757,2.605967,0.683509,DL


In [54]:
prathamesh_df.set_index('States')

Unnamed: 0_level_0,W,X,Y,Z
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
MH,2.70685,0.628133,0.907969,0.503826
OD,0.651118,-0.319318,-0.848077,0.605965
GJ,-2.018168,0.740122,0.528813,-0.589001
MP,0.188695,-0.758872,-0.933237,0.955057
DL,0.190794,1.978757,2.605967,0.683509


In [55]:
prathamesh_df

Unnamed: 0,W,X,Y,Z,States
A,2.70685,0.628133,0.907969,0.503826,MH
B,0.651118,-0.319318,-0.848077,0.605965,OD
C,-2.018168,0.740122,0.528813,-0.589001,GJ
D,0.188695,-0.758872,-0.933237,0.955057,MP
E,0.190794,1.978757,2.605967,0.683509,DL


In [56]:
prathamesh_df.set_index('States', inplace=True)

In [57]:
prathamesh_df

Unnamed: 0_level_0,W,X,Y,Z
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
MH,2.70685,0.628133,0.907969,0.503826
OD,0.651118,-0.319318,-0.848077,0.605965
GJ,-2.018168,0.740122,0.528813,-0.589001
MP,0.188695,-0.758872,-0.933237,0.955057
DL,0.190794,1.978757,2.605967,0.683509


## Handling Missing Data

In [58]:
prathamesh_df = prathamesh_pd.read_csv('./Stores_with_null.csv')

In [59]:
prathamesh_df

Unnamed: 0,Store ID,Store_Area,Items_Available,Daily_Customer_Count,Store_Sales
0,1,1659.0,1961.0,,66490.0
1,2,,1752.0,210.0,39820.0
2,3,1340.0,1609.0,720.0,
3,4,1451.0,1748.0,620.0,53730.0
4,5,1770.0,2111.0,,46620.0
...,...,...,...,...,...
891,892,1582.0,1910.0,1080.0,66390.0
892,893,1387.0,1663.0,850.0,82080.0
893,894,1200.0,1436.0,1060.0,76440.0
894,895,1299.0,1560.0,770.0,96610.0


isnull function returns a new dataframe with true for all values that are null and false for all values not null.

In [60]:
prathamesh_df.isnull()

Unnamed: 0,Store ID,Store_Area,Items_Available,Daily_Customer_Count,Store_Sales
0,False,False,False,True,False
1,False,True,False,False,False
2,False,False,False,False,True
3,False,False,False,False,False
4,False,False,False,True,False
...,...,...,...,...,...
891,False,False,False,False,False
892,False,False,False,False,False
893,False,False,False,False,False
894,False,False,False,False,False


notnull function  returns a new dataframe with value true if the value is not null and vice versa.

In [61]:
prathamesh_df.notnull()

Unnamed: 0,Store ID,Store_Area,Items_Available,Daily_Customer_Count,Store_Sales
0,True,True,True,False,True
1,True,False,True,True,True
2,True,True,True,True,False
3,True,True,True,True,True
4,True,True,True,False,True
...,...,...,...,...,...
891,True,True,True,True,True
892,True,True,True,True,True
893,True,True,True,True,True
894,True,True,True,True,True


dropna funcion drops all rows/columns from the axis where any value is null. You can set a threshold for how many values can be null in an axis. By default function drops all rows with NaN data. Default value for axis is 0 as ususal. 

In [62]:
prathamesh_df.dropna()

Unnamed: 0,Store ID,Store_Area,Items_Available,Daily_Customer_Count,Store_Sales
3,4,1451.0,1748.0,620.0,53730.0
6,7,1542.0,1858.0,1030.0,72240.0
9,10,1030.0,1235.0,1130.0,44150.0
11,12,1751.0,2098.0,720.0,57620.0
12,13,1746.0,2064.0,1050.0,60470.0
...,...,...,...,...,...
891,892,1582.0,1910.0,1080.0,66390.0
892,893,1387.0,1663.0,850.0,82080.0
893,894,1200.0,1436.0,1060.0,76440.0
894,895,1299.0,1560.0,770.0,96610.0


In [63]:
# Extra
prathamesh_df.iloc[0:15,:].dropna(axis=1)

Unnamed: 0,Store ID
0,1
1,2
2,3
3,4
4,5
5,6
6,7
7,8
8,9
9,10


In [64]:
prathamesh_df.iloc[11:15,:].dropna(axis=1) # no null data after 10

Unnamed: 0,Store ID,Store_Area,Items_Available,Daily_Customer_Count,Store_Sales
11,12,1751.0,2098.0,720.0,57620.0
12,13,1746.0,2064.0,1050.0,60470.0
13,14,1615.0,1931.0,1160.0,59130.0
14,15,1469.0,1756.0,770.0,66360.0


In [65]:
prathamesh_df.dropna(axis=1, inplace=True)

In [66]:
prathamesh_df

Unnamed: 0,Store ID
0,1
1,2
2,3
3,4
4,5
...,...
891,892
892,893
893,894
894,895


In [68]:
prathamesh_df = prathamesh_pd.read_csv('./Stores_with_null.csv')
prathamesh_df

Unnamed: 0,Store ID,Store_Area,Items_Available,Daily_Customer_Count,Store_Sales
0,1,1659.0,1961.0,,66490.0
1,2,,1752.0,210.0,39820.0
2,3,1340.0,1609.0,720.0,
3,4,1451.0,1748.0,620.0,53730.0
4,5,1770.0,2111.0,,46620.0
...,...,...,...,...,...
891,892,1582.0,1910.0,1080.0,66390.0
892,893,1387.0,1663.0,850.0,82080.0
893,894,1200.0,1436.0,1060.0,76440.0
894,895,1299.0,1560.0,770.0,96610.0


## Threshold param
Threshold Param is the count of number of values that must be in that axis for it to be not dropped.
An example for this is say if 70 % of people have not responded to a specific quesiton in an questionaire, then the answer for that question is useless in statistical analysis. so we should drop the column.

In [69]:
prathamesh_df.dropna(axis=0,thresh=5).head(10)

Unnamed: 0,Store ID,Store_Area,Items_Available,Daily_Customer_Count,Store_Sales
3,4,1451.0,1748.0,620.0,53730.0
6,7,1542.0,1858.0,1030.0,72240.0
9,10,1030.0,1235.0,1130.0,44150.0
11,12,1751.0,2098.0,720.0,57620.0
12,13,1746.0,2064.0,1050.0,60470.0
13,14,1615.0,1931.0,1160.0,59130.0
14,15,1469.0,1756.0,770.0,66360.0
15,16,1644.0,1950.0,790.0,78870.0
16,17,1578.0,1907.0,1440.0,77250.0
17,18,1703.0,2045.0,670.0,38170.0


In [70]:
prathamesh_df.dropna(axis=0,thresh=1).head(10)

Unnamed: 0,Store ID,Store_Area,Items_Available,Daily_Customer_Count,Store_Sales
0,1,1659.0,1961.0,,66490.0
1,2,,1752.0,210.0,39820.0
2,3,1340.0,1609.0,720.0,
3,4,1451.0,1748.0,620.0,53730.0
4,5,1770.0,2111.0,,46620.0
5,6,,,,45260.0
6,7,1542.0,1858.0,1030.0,72240.0
7,8,1261.0,1507.0,1020.0,
8,9,,1321.0,680.0,46310.0
9,10,1030.0,1235.0,1130.0,44150.0


In [71]:
prathamesh_df

Unnamed: 0,Store ID,Store_Area,Items_Available,Daily_Customer_Count,Store_Sales
0,1,1659.0,1961.0,,66490.0
1,2,,1752.0,210.0,39820.0
2,3,1340.0,1609.0,720.0,
3,4,1451.0,1748.0,620.0,53730.0
4,5,1770.0,2111.0,,46620.0
...,...,...,...,...,...
891,892,1582.0,1910.0,1080.0,66390.0
892,893,1387.0,1663.0,850.0,82080.0
893,894,1200.0,1436.0,1060.0,76440.0
894,895,1299.0,1560.0,770.0,96610.0


## Fillna
fillna function fills all null values with some value given. We can also use methods of filling instead of value. The methods for filling are as follows:
1. value: In this method, we pass the value kwarg to the function. It returns a dataframe with all null values replaced with the given value.
2. ffill: In this method, we pass a kwarg method with ffill. this fills all null values with values in front of the null value. in ffill, last row with null values stay null as there is nothing to fill those with.
3. bfill: It is same as ffill but with values behind. In bfill, first row remains null as there is nothing to fill those with.
We have more params in fillna. 
1. We can use limit param to limit the number of rows filled with fillna.
We can also use value method to fill values with different statistical functions for the data such as mean, median or mode.

In [72]:
prathamesh_df.fillna(value=-99)

Unnamed: 0,Store ID,Store_Area,Items_Available,Daily_Customer_Count,Store_Sales
0,1,1659.0,1961.0,-99.0,66490.0
1,2,-99.0,1752.0,210.0,39820.0
2,3,1340.0,1609.0,720.0,-99.0
3,4,1451.0,1748.0,620.0,53730.0
4,5,1770.0,2111.0,-99.0,46620.0
...,...,...,...,...,...
891,892,1582.0,1910.0,1080.0,66390.0
892,893,1387.0,1663.0,850.0,82080.0
893,894,1200.0,1436.0,1060.0,76440.0
894,895,1299.0,1560.0,770.0,96610.0


In [73]:
prathamesh_df.fillna(value="no value")

Unnamed: 0,Store ID,Store_Area,Items_Available,Daily_Customer_Count,Store_Sales
0,1,1659.0,1961.0,no value,66490.0
1,2,no value,1752.0,210.0,39820.0
2,3,1340.0,1609.0,720.0,no value
3,4,1451.0,1748.0,620.0,53730.0
4,5,1770.0,2111.0,no value,46620.0
...,...,...,...,...,...
891,892,1582.0,1910.0,1080.0,66390.0
892,893,1387.0,1663.0,850.0,82080.0
893,894,1200.0,1436.0,1060.0,76440.0
894,895,1299.0,1560.0,770.0,96610.0


In [74]:
prathamesh_df['Store_Area'].fillna(value=prathamesh_df['Store_Area'].mean())

0      1659.000000
1      1485.928331
2      1340.000000
3      1451.000000
4      1770.000000
          ...     
891    1582.000000
892    1387.000000
893    1200.000000
894    1299.000000
895    1174.000000
Name: Store_Area, Length: 896, dtype: float64

In [75]:
prathamesh_df.fillna(method='pad')

Unnamed: 0,Store ID,Store_Area,Items_Available,Daily_Customer_Count,Store_Sales
0,1,1659.0,1961.0,,66490.0
1,2,1659.0,1752.0,210.0,39820.0
2,3,1340.0,1609.0,720.0,39820.0
3,4,1451.0,1748.0,620.0,53730.0
4,5,1770.0,2111.0,620.0,46620.0
...,...,...,...,...,...
891,892,1582.0,1910.0,1080.0,66390.0
892,893,1387.0,1663.0,850.0,82080.0
893,894,1200.0,1436.0,1060.0,76440.0
894,895,1299.0,1560.0,770.0,96610.0


In [76]:
prathamesh_df = prathamesh_pd.read_csv('./Stores_with_null.csv')

In [77]:
prathamesh_df.fillna(method='bfill')

Unnamed: 0,Store ID,Store_Area,Items_Available,Daily_Customer_Count,Store_Sales
0,1,1659.0,1961.0,210.0,66490.0
1,2,1340.0,1752.0,210.0,39820.0
2,3,1340.0,1609.0,720.0,53730.0
3,4,1451.0,1748.0,620.0,53730.0
4,5,1770.0,2111.0,1030.0,46620.0
...,...,...,...,...,...
891,892,1582.0,1910.0,1080.0,66390.0
892,893,1387.0,1663.0,850.0,82080.0
893,894,1200.0,1436.0,1060.0,76440.0
894,895,1299.0,1560.0,770.0,96610.0


## Replace
We can use the replace function to replace any given value with another. 

In [78]:
prathamesh_df.replace(to_replace=np.nan, value=-99)

Unnamed: 0,Store ID,Store_Area,Items_Available,Daily_Customer_Count,Store_Sales
0,1,1659.0,1961.0,-99.0,66490.0
1,2,-99.0,1752.0,210.0,39820.0
2,3,1340.0,1609.0,720.0,-99.0
3,4,1451.0,1748.0,620.0,53730.0
4,5,1770.0,2111.0,-99.0,46620.0
...,...,...,...,...,...
891,892,1582.0,1910.0,1080.0,66390.0
892,893,1387.0,1663.0,850.0,82080.0
893,894,1200.0,1436.0,1060.0,76440.0
894,895,1299.0,1560.0,770.0,96610.0


In [79]:
prathamesh_df.replace(to_replace=np.nan, value=-99, inplace=True)

In [80]:
prathamesh_df

Unnamed: 0,Store ID,Store_Area,Items_Available,Daily_Customer_Count,Store_Sales
0,1,1659.0,1961.0,-99.0,66490.0
1,2,-99.0,1752.0,210.0,39820.0
2,3,1340.0,1609.0,720.0,-99.0
3,4,1451.0,1748.0,620.0,53730.0
4,5,1770.0,2111.0,-99.0,46620.0
...,...,...,...,...,...
891,892,1582.0,1910.0,1080.0,66390.0
892,893,1387.0,1663.0,850.0,82080.0
893,894,1200.0,1436.0,1060.0,76440.0
894,895,1299.0,1560.0,770.0,96610.0


## Interpolation
Interpolation is the process of estimating missing or undefined values based on the known values in a dataset. The interpolate() method can perform linear or polynomial interpolation on a Series or DataFrame. Linear interpolation is the process of estimating values within a range based on two known values at the edges of the range. Polynomial interpolation is the process of fitting a polynomial function to a set of data points and using the function to estimate missing values. By default, the interpolate() method performs linear interpolation. However, we can specify the type of interpolation to be used by passing the method parameter. The method parameter can take values such as 'linear', 'quadratic', 'cubic', and so on. For example, to perform cubic interpolation, we can set method='cubic'.

In addition to the method parameter, the interpolate() method also has other parameters such as limit and limit_direction. The limit parameter sets the maximum number of consecutive NaN values that can be filled. The limit_direction parameter sets the direction in which the filling should occur, either 'forward', 'backward', or 'both'.

In [81]:
prathamesh_df = prathamesh_pd.read_csv('./Stores_with_null.csv')

In [82]:
prathamesh_df

Unnamed: 0,Store ID,Store_Area,Items_Available,Daily_Customer_Count,Store_Sales
0,1,1659.0,1961.0,,66490.0
1,2,,1752.0,210.0,39820.0
2,3,1340.0,1609.0,720.0,
3,4,1451.0,1748.0,620.0,53730.0
4,5,1770.0,2111.0,,46620.0
...,...,...,...,...,...
891,892,1582.0,1910.0,1080.0,66390.0
892,893,1387.0,1663.0,850.0,82080.0
893,894,1200.0,1436.0,1060.0,76440.0
894,895,1299.0,1560.0,770.0,96610.0


In [83]:
prathamesh_df.interpolate(method='linear', limit_direction='forward')

Unnamed: 0,Store ID,Store_Area,Items_Available,Daily_Customer_Count,Store_Sales
0,1,1659.0,1961.0,,66490.0
1,2,1499.5,1752.0,210.000000,39820.0
2,3,1340.0,1609.0,720.000000,46775.0
3,4,1451.0,1748.0,620.000000,53730.0
4,5,1770.0,2111.0,756.666667,46620.0
...,...,...,...,...,...
891,892,1582.0,1910.0,1080.000000,66390.0
892,893,1387.0,1663.0,850.000000,82080.0
893,894,1200.0,1436.0,1060.000000,76440.0
894,895,1299.0,1560.0,770.000000,96610.0


## Pandas Operations

In [None]:
import pandas as prathamesh_pd

In [None]:
prathamesh_groceries = {
    'Company':['Nestle','Nestle','Parle','Colgate','Parle'],
    'Person':['Prathamesh','Pratik','Pranav','Pranali','Prachi'],
    'Sales':[20000,12000,34000,12400,24300]
}

In [None]:
prathamesh_df = prathamesh_pd.DataFrame(data=prathamesh_groceries)
prathamesh_df

Unnamed: 0,Company,Person,Sales
0,Nestle,Prathamesh,20000
1,Nestle,Pratik,12000
2,Parle,Pranav,34000
3,Colgate,Pranali,12400
4,Parle,Prachi,24300


## Group By 

The groupby method allows you to group rows of data together and call aggregate functions.
It generates a special GroupBy object.
The group by object is a great way to perform operations on specific groups of rows in a dataframe. It is similar to SQL group by.

In [None]:
group_by_company = prathamesh_df.groupby('Company')

### First method

The first method returns the first row of each group.

In [None]:
group_by_company.first()

Unnamed: 0_level_0,Person,Sales
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
Colgate,Pranali,12400
Nestle,Prathamesh,20000
Parle,Pranav,34000


Merging, Joining, Concatenating

In [None]:
prathamesh_data_1 = {
    'Student_id': [32468, 32469, 32470, 32471, 32472],
    'Student_name': ['Prathamesh', 'Pratik', 'Pranav', 'Pranali', 'Prachi'],
    'Physics': [52, 36, 70, 27, 43],
    'Chemistry': [80, 46, 68, 43, 62],
    'Biology': [67, 59, 39, 52, 79],
    'Mathematics': [68, 21, 42, 78, 43]
}
prathamesh_df1 = prathamesh_pd.DataFrame(data=prathamesh_data_1)
prathamesh_df1

Unnamed: 0,Student_id,Student_name,Physics,Chemistry,Biology,Mathematics
0,32468,Prathamesh,52,80,67,68
1,32469,Pratik,36,46,59,21
2,32470,Pranav,70,68,39,42
3,32471,Pranali,27,43,52,78
4,32472,Prachi,43,62,79,43


In [None]:
prathamesh_data_2 = {
  'Student_id': [32473, 32474, 32475, 32476, 32477],
  'Student_name': ['Sachin', 'Saurav', 'Rahul', 'Virat', 'Rohit'],
  'Physics': [30, 55, 71, 21, 57],
  'Chemistry': [71, 52, 80, 28, 42],
  'Biology': [42, 75, 50, 23, 25],
  'Mathematics': [79, 42, 79, 72, 52]
}
prathamesh_df2 = prathamesh_pd.DataFrame(prathamesh_data_2, index = [6, 7, 8, 9, 10])
prathamesh_df2

Unnamed: 0,Student_id,Student_name,Physics,Chemistry,Biology,Mathematics
6,32473,Sachin,30,71,42,79
7,32474,Saurav,55,52,75,42
8,32475,Rahul,71,80,50,79
9,32476,Virat,21,28,23,72
10,32477,Rohit,57,42,25,52


In [None]:
prathamesh_data_3 = {
  'Student_id': [32478, 32479, 32480, 32481, 32482],
  'Student_name': ['Raj', 'Rohan', 'Rahul', 'Ravi', 'Ramesh'],
    'Physics': [74, 47, 62, 69, 47],
    'Chemistry': [53, 46, 23, 68, 29],
    'Biology': [70, 33, 47, 55, 59],
    'Mathematics': [77, 53, 64, 73, 23]
}
prathamesh_df3 = prathamesh_pd.DataFrame(prathamesh_data_3, index = [11, 12, 13, 14, 15])
prathamesh_df3

Unnamed: 0,Student_id,Student_name,Physics,Chemistry,Biology,Mathematics
11,32478,Raj,74,53,70,77
12,32479,Rohan,47,46,33,53
13,32480,Rahul,62,23,47,64
14,32481,Ravi,69,68,55,73
15,32482,Ramesh,47,29,59,23


In [None]:
prathamesh_pd.concat([prathamesh_df1,prathamesh_df2,prathamesh_df3])

Unnamed: 0,Student_id,Student_name,Physics,Chemistry,Biology,Mathematics
0,32468,Prathamesh,52,80,67,68
1,32469,Pratik,36,46,59,21
2,32470,Pranav,70,68,39,42
3,32471,Pranali,27,43,52,78
4,32472,Prachi,43,62,79,43
6,32473,Sachin,30,71,42,79
7,32474,Saurav,55,52,75,42
8,32475,Rahul,71,80,50,79
9,32476,Virat,21,28,23,72
10,32477,Rohit,57,42,25,52


In [None]:
prathamesh_pd.concat([prathamesh_df1,prathamesh_df2,prathamesh_df3],axis=1)

Unnamed: 0,Student_id,Student_name,Physics,Chemistry,Biology,Mathematics,Student_id.1,Student_name.1,Physics.1,Chemistry.1,Biology.1,Mathematics.1,Student_id.2,Student_name.2,Physics.2,Chemistry.2,Biology.2,Mathematics.2
0,32468.0,Prathamesh,52.0,80.0,67.0,68.0,,,,,,,,,,,,
1,32469.0,Pratik,36.0,46.0,59.0,21.0,,,,,,,,,,,,
2,32470.0,Pranav,70.0,68.0,39.0,42.0,,,,,,,,,,,,
3,32471.0,Pranali,27.0,43.0,52.0,78.0,,,,,,,,,,,,
4,32472.0,Prachi,43.0,62.0,79.0,43.0,,,,,,,,,,,,
6,,,,,,,32473.0,Sachin,30.0,71.0,42.0,79.0,,,,,,
7,,,,,,,32474.0,Saurav,55.0,52.0,75.0,42.0,,,,,,
8,,,,,,,32475.0,Rahul,71.0,80.0,50.0,79.0,,,,,,
9,,,,,,,32476.0,Virat,21.0,28.0,23.0,72.0,,,,,,
10,,,,,,,32477.0,Rohit,57.0,42.0,25.0,52.0,,,,,,


In [None]:
prathamesh_data_1_1 = {
    'Student_id': [32468, 32469, 32470, 32471, 32472],
    'Student_name': ['Prathamesh', 'Pratik', 'Pranav', 'Pranali', 'Prachi'],
    'Physics': [52, 36, 70, 27, 43],
    'Chemistry': [80, 46, 68, 43, 62],
}
prathamesh_data_1_2 = {
    'Student_id': [32468, 32469, 32470, 32471, 32472],
    'Student_name': ['Prathamesh', 'Pratik', 'Pranav', 'Pranali', 'Prachi'],
    'Biology': [67, 59, 39, 52, 79],
    'Mathematics': [68, 21, 42, 78, 43]

}
prathamesh_left = prathamesh_pd.DataFrame(prathamesh_data_1_1)
prathamesh_right = prathamesh_pd.DataFrame(prathamesh_data_1_2)

In [None]:
prathamesh_left

Unnamed: 0,Student_id,Student_name,Physics,Chemistry
0,32468,Prathamesh,52,80
1,32469,Pratik,36,46
2,32470,Pranav,70,68
3,32471,Pranali,27,43
4,32472,Prachi,43,62


In [None]:
prathamesh_right

Unnamed: 0,Student_id,Student_name,Biology,Mathematics
0,32468,Prathamesh,67,68
1,32469,Pratik,59,21
2,32470,Pranav,39,42
3,32471,Pranali,52,78
4,32472,Prachi,79,43


In [None]:
prathamesh_pd.merge(prathamesh_left,prathamesh_right,how='inner',on='Student_id')

Unnamed: 0,Student_id,Student_name_x,Physics,Chemistry,Student_name_y,Biology,Mathematics
0,32468,Prathamesh,52,80,Prathamesh,67,68
1,32469,Pratik,36,46,Pratik,59,21
2,32470,Pranav,70,68,Pranav,39,42
3,32471,Pranali,27,43,Pranali,52,78
4,32472,Prachi,43,62,Prachi,79,43


In [None]:
prathamesh_data_2_1 = {
  'Student_id': [32473, 32473, 32473, 32473, 32477],
  'Student_name': ['Sachin', 'Sachin', 'Saurav', 'Saurav', 'Rahul'],
  'Physics': [30, 55, 71, 21, 57],
  'Chemistry': [71, 52, 80, 28, 42]
}
prathamesh_data_2_2 = {
  'Student_id': [32473, 32474, 32474, 32474, 32475],
  'Student_name': ['Sachin', 'Sachin', 'Sachin', 'Sachin', 'Sachin'],
  'Biology': [42, 75, 50, 23, 25],
  'Mathematics': [79, 42, 79, 72, 52]
}
prathamesh_left = prathamesh_pd.DataFrame(prathamesh_data_2_1)
prathamesh_right = prathamesh_pd.DataFrame(prathamesh_data_2_2)

In [None]:
prathamesh_left

Unnamed: 0,Student_id,Student_name,Physics,Chemistry
0,32473,Sachin,30,71
1,32473,Sachin,55,52
2,32473,Saurav,71,80
3,32473,Saurav,21,28
4,32477,Rahul,57,42


In [None]:
prathamesh_right

Unnamed: 0,Student_id,Student_name,Biology,Mathematics
0,32473,Sachin,42,79
1,32474,Sachin,75,42
2,32474,Sachin,50,79
3,32474,Sachin,23,72
4,32475,Sachin,25,52


In [None]:
prathamesh_pd.merge(prathamesh_left,prathamesh_right,on=['Student_id','Student_name'])

Unnamed: 0,Student_id,Student_name,Physics,Chemistry,Biology,Mathematics
0,32473,Sachin,30,71,42,79
1,32473,Sachin,55,52,42,79


In [None]:
prathamesh_pd.merge(prathamesh_left,prathamesh_right,how='outer',on=['Student_id','Student_name'])

Unnamed: 0,Student_id,Student_name,Physics,Chemistry,Biology,Mathematics
0,32473,Sachin,30.0,71.0,42.0,79.0
1,32473,Sachin,55.0,52.0,42.0,79.0
2,32473,Saurav,71.0,80.0,,
3,32473,Saurav,21.0,28.0,,
4,32477,Rahul,57.0,42.0,,
5,32474,Sachin,,,75.0,42.0
6,32474,Sachin,,,50.0,79.0
7,32474,Sachin,,,23.0,72.0
8,32475,Sachin,,,25.0,52.0


In [None]:
prathamesh_pd.merge(prathamesh_left,prathamesh_right,how='right',on=['Student_id','Student_name'])

Unnamed: 0,Student_id,Student_name,Physics,Chemistry,Biology,Mathematics
0,32473,Sachin,30.0,71.0,42,79
1,32473,Sachin,55.0,52.0,42,79
2,32474,Sachin,,,75,42
3,32474,Sachin,,,50,79
4,32474,Sachin,,,23,72
5,32475,Sachin,,,25,52


In [None]:
prathamesh_pd.merge(prathamesh_left,prathamesh_right,how='left',on=['Student_id','Student_name'])

Unnamed: 0,Student_id,Student_name,Physics,Chemistry,Biology,Mathematics
0,32473,Sachin,30,71,42.0,79.0
1,32473,Sachin,55,52,42.0,79.0
2,32473,Saurav,71,80,,
3,32473,Saurav,21,28,,
4,32477,Rahul,57,42,,


In [None]:
prathamesh_data_3_1 = {
  'Physics': [30, 55, 71, 21, 57],
  'Chemistry': [71, 52, 80, 28, 42]
}
prathamesh_data_3_2 = {
  'Biology': [42, 75, 50, 23, 25],
  'Mathematics': [79, 42, 79, 72, 52]
}
prathamesh_left = prathamesh_pd.DataFrame(prathamesh_data_3_1,index = [32473, 32473, 32473, 32473, 32477])
prathamesh_right = prathamesh_pd.DataFrame(prathamesh_data_3_2,index = [32473, 32474, 32474, 32474, 32475])

In [None]:
prathamesh_left

Unnamed: 0,Physics,Chemistry
32473,30,71
32473,55,52
32473,71,80
32473,21,28
32477,57,42


In [None]:
prathamesh_right

Unnamed: 0,Biology,Mathematics
32473,42,79
32474,75,42
32474,50,79
32474,23,72
32475,25,52


In [None]:
prathamesh_left.join(prathamesh_right)

Unnamed: 0,Physics,Chemistry,Biology,Mathematics
32473,30,71,42.0,79.0
32473,55,52,42.0,79.0
32473,71,80,42.0,79.0
32473,21,28,42.0,79.0
32477,57,42,,


In [None]:
prathamesh_right.join(prathamesh_left)

Unnamed: 0,Biology,Mathematics,Physics,Chemistry
32473,42,79,30.0,71.0
32473,42,79,55.0,52.0
32473,42,79,71.0,80.0
32473,42,79,21.0,28.0
32474,75,42,,
32474,50,79,,
32474,23,72,,
32475,25,52,,


In [None]:
prathamesh_left.join(prathamesh_right,how='outer')

Unnamed: 0,Physics,Chemistry,Biology,Mathematics
32473,30.0,71.0,42.0,79.0
32473,55.0,52.0,42.0,79.0
32473,71.0,80.0,42.0,79.0
32473,21.0,28.0,42.0,79.0
32474,,,75.0,42.0
32474,,,50.0,79.0
32474,,,23.0,72.0
32475,,,25.0,52.0
32477,57.0,42.0,,


## Conclusion:
I have learned how to use pandas to merge, join, and concatenate data.

## Conclusion
We have successfully understood how to use pandas library for data exploratory analysis.