<div id="header"><p style="color:#6a66bd; text-align:center; font-weight:bold; font-family:verdana; font-size:25px;">Pandas - A Complete Guide</p></div>

---

<p style="text-align:right; font-family:verdana;">Follow <a href="https://github.com/TheMrityunjayPathak" style="color:#6a66bd; text-decoration:none;">@Mrityunjay Pathak</a> for more!</p>
    
<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<font color='#6a66bd' size="5px">Introduction to Pandas</font>
<br>
<br>
<strong>What is Pandas?</strong>
<br>
• Pandas is a Python library used for working with data sets.
<br>
• It has functions for analyzing, cleaning, exploring and manipulating data.
<br>
<br>
<strong>Why use Pandas?</strong>
<br>
• Pandas allows us to analyze big data and make conclusions based on statistical theories.
<br>
• Pandas can clean messy data sets and make them readable and relevant.
</div>

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<font color='#6a66bd' size="5px">Installing Pandas</font>
<br>
<br>
• If you have <a href="https://www.python.org/downloads/" style="text-decoration:none; color:#6a66bd;">Python</a> and <a href="https://pypi.org/project/pip/" style="text-decoration:none; color:#6a66bd;">PIP</a> already installed on your system, then installation of Pandas is very easy.
<br>
• Install it using this command :
<div style="background-color:#ADD8E6; padding:8px; border:1px solid #87CEEB; border-radius:4px;">
<strong>Note :</strong> You can use your Jupyter Notebook as Terminal to install anything using PIP Command.
</div>
<div style="background-color:#ADD8E6; padding:8px; border:1px solid #87CEEB; border-radius:4px;">
<strong>Note :</strong> Just add an exclamation mark before your PIP Command. Like !pip install pandas
</div>
</div>

In [1]:
!pip install pandas



<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<font color='#6a66bd' size="5px">Getting Started with Pandas</font>
<br>
<br>
<strong>Import Pandas</strong>
<br>
• Once Pandas is installed, import it in your project by adding the import keyword.
</div>

In [2]:
import pandas

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Pandas as pd</strong>
<br>
• Pandas is usually imported under the pd alias.
<br>
• Create an alias with the as keyword while importing.
<div style="background-color:#ADD8E6; padding:8px; border:1px solid #87CEEB; border-radius:4px;">
<strong>Note :</strong> In Python alias are an alternate name for referring to the same thing.
</div>
</div>

In [3]:
import pandas as pd

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Checking Pandas Version</strong>
<br>
• The version string is stored under __version__ attribute.
</div>

In [4]:
print(pandas.__version__)

2.1.4


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<font color='#6a66bd' size="5px">Pandas Series</font>
<br>
<br>
• A Pandas Series is a one-dimensional labeled array-like object that can hold data of any type.
<br>
• A Pandas Series can be thought of as a column in a spreadsheet or a single column of a DataFrame. 
<br>
• It consists of two main components : the labels and the data.
<br>
• The labels are the index values assigned to each data point, while the data represents the actual values stored in the Series.
<div style="background-color:#ADD8E6; padding:8px; border:1px solid #87CEEB; border-radius:4px;">
<strong>Note :</strong> NaN (not a number) is the standard missing data marker used in Pandas.
</div>
</div>

In [5]:
#Creating Series with a List
x = [1, 7, 2]
ser = pd.Series(x)
ser

0    1
1    7
2    2
dtype: int64

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
• You can also create a Pandas Series from a Python dictionary. 
<br>
• By using a key/value object, like a dictionary, you can create a Series.
<div style="background-color:#ADD8E6; padding:8px; border:1px solid #87CEEB; border-radius:4px;">
<strong>Note :</strong> The keys of the dictionary have become the labels and the values of the dictionary have become the data.
</div>
</div>

In [6]:
#Creating Series with a Dictionary
x = {"Math":89,"English":75,"Science":99}
grades = pd.Series(x)
grades

Math       89
English    75
Science    99
dtype: int64

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
• To select only some of the items in the dictionary while creating Series, use the index argument and specify only the items you want to include in the Series.
</div>

In [7]:
#Creating Series with only selected items of the Dictionary
x = {"Math":89,"English":75,"Science":99}
grades = pd.Series(x, index=["Math","Science"])
grades

Math       89
Science    99
dtype: int64

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
• You can also name your Series by adding name parameter while creating a Series.
</div>

In [8]:
#Creating Series with a specific name
x = {"Math":89,"English":75,"Science":99}
grades = pd.Series(x, index=["Math","Science"], name="Grades")
grades

Math       89
Science    99
Name: Grades, dtype: int64

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
• You can also create a Series from a Scalar value.
<br>
• Just remember that when using a Scalar value for creating a Series, you must provide index.
<br>
• The value will be repeated to match the length of index.
</div>

In [9]:
#Creating Series by using a Scalar Value
ser = pd.Series(5, index=["One","Two","Three"])
ser

One      5
Two      5
Three    5
dtype: int64

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Label</strong>
<br>
• The labels in the Pandas Series are index numbers by default. Like List and Strings, the index number in series starts from 0.
<br>
<div style="background-color:#ADD8E6; padding:8px; border:1px solid #87CEEB; border-radius:4px;">
<strong>Note :</strong> If nothing else is specified, the values are labeled with their index number.
</div>
<div style="background-color:#ADD8E6; padding:8px; border:1px solid #87CEEB; border-radius:4px;">
<strong>Note :</strong> This label can be used to access a specified value.
</div>
</div>

In [10]:
#Accessing values of Series by Label
x = [1, 7, 2]
ser = pd.Series(x)
ser[1]

7

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Create Labels</strong>
<br>
• With the index argument, you can name your own labels.
</div>

In [11]:
#Creating user defined Labels with index parameter
x = [1, 7, 2]
ser = pd.Series(x, index=["X","Y","Z"])
ser

X    1
Y    7
Z    2
dtype: int64

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
• When you have created labels, you can access an item by referring to the label.
</div>

In [12]:
#Accessing Labels using user defined index
x = [1, 7, 2]
ser = pd.Series(x, index=["X","Y","Z"])
ser["Y"]

7

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<font color='#6a66bd' size="5px">Pandas DataFrame</font>
<br>
<br>
• A DataFrame is like a table where the data is organized in rows and columns.
<br>
• It is a two-dimensional data structure like a two-dimensional array.
</div>

In [13]:
#Creating a DataFrame with Dictionary
data = {'Name': ['John', 'Alice', 'Bob'],
       'Age': [25, 30, 35],
       'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City
0,John,25,New York
1,Alice,30,London
2,Bob,35,Paris


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
• You can also create a DataFrame with the help of Lists.
</div>

In [14]:
#Creating a DataFrame with Lists
data = [['John', 25, 'New York'],
       ['Alice', 30, 'London'],
       ['Bob', 35, 'Paris']]
df = pd.DataFrame(data)
df

Unnamed: 0,0,1,2
0,John,25,New York
1,Alice,30,London
2,Bob,35,Paris


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
• You can add column names for your DataFrame by adding columns parameter while creating DataFrame.
</div>

In [15]:
#Creating a DataFrame with Column Names
data = [['John', 25, 'New York'],
       ['Alice', 30, 'London'],
       ['Bob', 35, 'Paris']]
df = pd.DataFrame(data, columns=["Name","Age","Country"])
df

Unnamed: 0,Name,Age,Country
0,John,25,New York
1,Alice,30,London
2,Bob,35,Paris


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
• Sometimes we may want to create an empty DataFrame and then add data later. So, You can create an Empty DataFrame.
</div>

In [16]:
#Creating an Empty DataFrame
df = pd.DataFrame()
df



<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
• You can also create a DataFrame by loading data from a CSV (Comma Separated Values) File.
<br>
• We can also create a DataFrame using other file types like JSON, Excel spreadsheet, SQL database, etc.
<br>
• The methods to read different file types are listed below :
<br>
→ <strong>JSON</strong> - read_json()
<br>
→ <strong>Excel Spreadsheet</strong> - read_excel()
<br>
→ <strong>SQL</strong> - read_sql()
<div style="background-color:#ADD8E6; padding:8px; border:1px solid #87CEEB; border-radius:4px;">
<strong>Note :</strong> If you have a large DataFrame with many rows, Pandas will not print the entire DataFrame, So you can use to_string() method to print the entire DataFrame.
</div>
</div>

In [17]:
#Creating a DataFrame by loading data from CSV
df = pd.read_csv("canada_per_capita_income.csv")
df

Unnamed: 0,year,per_capita_income
0,1970,3399.299037
1,1971,3768.297935
2,1972,4251.175484
3,1973,4804.463248
4,1974,5576.514583
5,1975,5998.144346
6,1976,7062.131392
7,1977,7100.12617
8,1978,7247.967035
9,1979,7602.912681


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<font color='#6a66bd' size="5px">Pandas DataFrame Analysis</font>
<br>
<br>
• Pandas DataFrame objects come with a variety of built-in functions like head(), tail() and info() that allow us to view and analyze DataFrames.
</div>

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Pandas head()</strong>
<br>
• One of the most used method for getting a quick overview of the DataFrame is the head() method.
<br>
• The head() method returns the headers and a specified number of rows, starting from the top.
</div>

In [18]:
#head() prints only top 5 rows of DataFrame by default
df.head()

Unnamed: 0,year,per_capita_income
0,1970,3399.299037
1,1971,3768.297935
2,1972,4251.175484
3,1973,4804.463248
4,1974,5576.514583


In [19]:
#You can specify any number of rows in head() method to print it
df.head(10)

Unnamed: 0,year,per_capita_income
0,1970,3399.299037
1,1971,3768.297935
2,1972,4251.175484
3,1973,4804.463248
4,1974,5576.514583
5,1975,5998.144346
6,1976,7062.131392
7,1977,7100.12617
8,1978,7247.967035
9,1979,7602.912681


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Pandas tail()</strong>
<br>
• There is also a tail() method for viewing the last rows of the DataFrame.
<br>
• The tail() method returns the headers and a specified number of rows, starting from the bottom.
</div>

In [20]:
#tail() prints only bottom 5 rows of DataFrame by default
df.tail()

Unnamed: 0,year,per_capita_income
42,2012,42665.25597
43,2013,42676.46837
44,2014,41039.8936
45,2015,35175.18898
46,2016,34229.19363


In [21]:
#You can specify any number of rows in tail() method to print it
df.tail(10)

Unnamed: 0,year,per_capita_income
37,2007,36144.48122
38,2008,37446.48609
39,2009,32755.17682
40,2010,38420.52289
41,2011,42334.71121
42,2012,42665.25597
43,2013,42676.46837
44,2014,41039.8936
45,2015,35175.18898
46,2016,34229.19363




<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Pandas info()</strong>
<br>
• The DataFrame object has a method called info(), that gives you the overall information about the DataFrame such as its class, data type, size etc.
<br>
• As you can see, the info() method provides the following information about a Pandas DataFrame :
<br>
→ <strong>Class :</strong> The class of the object which indicates that it is a pandas DataFrame.
<br>
→ <strong>RangeIndex :</strong> The index range of the DataFrame showing the starting and ending index values.
<br>
→ <strong>Data columns :</strong> The total number of columns in the DataFrame.
<br>
→ <strong>Column names :</strong> The names of the columns in the DataFrame.
<br>
→ <strong>Non-Null Count :</strong> The count of non-null values for each column.
<br>
→ <strong>Dtype :</strong> The data types of the Columns.
<br>
→ <strong>Memory usage :</strong> The memory usage of the DataFrame in bytes.
</div>

In [22]:
#info() gives an overall information about the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47 entries, 0 to 46
Data columns (total 2 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   year               47 non-null     int64  
 1   per_capita_income  47 non-null     float64
dtypes: float64(1), int64(1)
memory usage: 884.0 bytes


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<font color='#6a66bd' size="5px">Pandas Summary Functions</font>
<br>
<br>
• In pandas, summary functions are used to quickly obtain descriptive statistics or summaries of data within a DataFrame or Series.
<br>
• These functions provide insights into the data's distribution, central tendency and other key statistical measures. 
<br>
• Here are some commonly used summary functions in Pandas :
</div>



<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
→ <strong>df.describe() :</strong> Generates descriptive statistics of numerical columns in the DataFrame such as count, mean, standard deviation, min, max and quartiles.
<div style="background-color:#ADD8E6; padding:8px; border:1px solid #87CEEB; border-radius:4px;">
<strong>Note :</strong> You can specify describe method for any particular column in your DataFrame.
</div>
</div>

In [23]:
#Description of the DataFrame
df = pd.read_csv("canada_per_capita_income.csv")
print(df.describe())

              year  per_capita_income
count    47.000000          47.000000
mean   1993.000000       18920.137063
std      13.711309       12034.679438
min    1970.000000        3399.299037
25%    1981.500000        9526.914515
50%    1993.000000       16426.725480
75%    2004.500000       27458.601420
max    2016.000000       42676.468370


In [24]:
#Description of the DataFrame for a specific column
print(df["per_capita_income"].describe())

count       47.000000
mean     18920.137063
std      12034.679438
min       3399.299037
25%       9526.914515
50%      16426.725480
75%      27458.601420
max      42676.468370
Name: per_capita_income, dtype: float64




<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
→ <strong>df.shape :</strong> Returns a tuple representing the dimensionality (rows, columns) of the DataFrame.
</div>

In [25]:
#Shape of the DataFrame
print(df.shape)

(47, 2)




<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
→ <strong>df.columns :</strong> Returns the column labels of the DataFrame.
</div>

In [26]:
#Columns of the DataFrame
print(df.columns)

Index(['year', 'per_capita_income'], dtype='object')




<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
→ <strong>df.dtypes :</strong> Returns the data types of each column in the DataFrame.
</div>

In [27]:
#DataType of the Columns in the DataFrame
print(df.dtypes)

year                   int64
per_capita_income    float64
dtype: object




<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
→ <strong>df.count() :</strong> Returns the number of non-null observations for each column.
<div style="background-color:#ADD8E6; padding:8px; border:1px solid #87CEEB; border-radius:4px;">
<strong>Note :</strong> You can specify count() method for any particular column in your DataFrame.
</div>
</div>

In [28]:
#Count of the observations in all columns of the DataFrame
print(df.count())

year                 47
per_capita_income    47
dtype: int64


In [29]:
#Count of the observation for a specific column of the DataFrame
print(df["per_capita_income"].count())

47




<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
→ <strong>df.isnull().sum() :</strong> Returns the number of null values in each column.
<div style="background-color:#ADD8E6; padding:8px; border:1px solid #87CEEB; border-radius:4px;">
<strong>Note :</strong> You can specify isnull().sum() method for any particular column in your DataFrame.
</div>
</div>

In [30]:
#Number of null values in the DataFrame
print(df.isnull().sum())

year                 0
per_capita_income    0
dtype: int64


In [31]:
#Number of null values in the DataFrame for a particular column
print(df["per_capita_income"].isnull().sum())

0




<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
→ <strong>df.value_counts() :</strong> Returns a Series containing counts of unique values in descending order.
</div>

In [32]:
#Count of the unique values of the data in DataFrame
print(df.value_counts())

year  per_capita_income
1970  3399.299037          1
2005  29198.055690         1
1996  16699.826680         1
1997  17310.757750         1
1998  16622.671870         1
1999  17581.024140         1
2000  18987.382410         1
2001  18601.397240         1
2002  19232.175560         1
2003  22739.426280         1
2004  25719.147150         1
2006  32738.262900         1
1994  15755.820270         1
2007  36144.481220         1
2008  37446.486090         1
2009  32755.176820         1
2010  38420.522890         1
2011  42334.711210         1
2012  42665.255970         1
2013  42676.468370         1
2014  41039.893600         1
2015  35175.188980         1
1995  16369.317250         1
1993  15875.586730         1
1971  3768.297935          1
1981  9434.390652          1
1972  4251.175484          1
1973  4804.463248          1
1974  5576.514583          1
1975  5998.144346          1
1976  7062.131392          1
1977  7100.126170          1
1978  7247.967035          1
1979  7602.912681  



<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
→ <strong>df.unique() :</strong> Returns an array of unique values found in a Series.
</div>

In [33]:
#Unique values of a specific column in a DataFrame
print(df["year"].unique())

[1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983
 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997
 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
 2012 2013 2014 2015 2016]


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<font color='#6a66bd' size="5px">Pandas Index</font>
<br>
<br>
• In Pandas, an index refers to the labeled array that identifies rows or columns in a DataFrame or a Series. 
<br>
• We can use indexes to uniquely identify data and access data with efficiency and precision.
</div>

In [34]:
#Index gives the start index and end index of the DataFrame
df.index

RangeIndex(start=0, stop=47, step=1)

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Access Columns of a DataFrame</strong>
<br>
• We can access columns of a DataFrame using the bracket [] operator.
</div>

In [35]:
#Accessing Columns of the DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 32, 18, 47, 33],
    'City': ['New York', 'Paris', 'London', 'Tokyo', 'Sydney']
}
df = pd.DataFrame(data)
df['Name']

0      Alice
1        Bob
2    Charlie
3      David
4        Eve
Name: Name, dtype: object

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
• We can also access multiple columns using the [] operator.
</div>

In [36]:
#Accessing Multiple Columns of the DataFrame
df[['Name','Age']]

Unnamed: 0,Name,Age
0,Alice,25
1,Bob,32
2,Charlie,18
3,David,47
4,Eve,33




<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Pandas .loc</strong>
<br>
• In Pandas, we use the .loc property to access and modify data within a DataFrame using label-based indexing.
<br>
• It allows us to select specific rows and columns based on their labels.
<br>
• The syntax of <code>.loc</code> in Pandas is : <code>df.loc[row_indexer, column_indexer]</code>
<br>
→ <strong>row_indexer -</strong> Selects rows by their labels and can be a single label, a list of labels or a boolean array.
<br>
→ <strong>column_indexer -</strong> Selects columns and can also be a single label, a list of labels or a boolean array.
<div style="background-color:#ADD8E6; padding:8px; border:1px solid #87CEEB; border-radius:4px;">
<strong>Note :</strong> We used .loc to access a row, a list of rows, a list of columns and a specific value using the respective labels.
</div>
</div>

In [37]:
#Creating a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 32, 18, 47, 33],
    'City': ['New York', 'Paris', 'London', 'Tokyo', 'Sydney']
}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,32,Paris
2,Charlie,18,London
3,David,47,Tokyo
4,Eve,33,Sydney


In [38]:
#Accessing a Single Row
df.loc[0]

Name       Alice
Age           25
City    New York
Name: 0, dtype: object

In [39]:
#Accessing Multiple Rows
df.loc[[0, 3, 4]]

Unnamed: 0,Name,Age,City
0,Alice,25,New York
3,David,47,Tokyo
4,Eve,33,Sydney


In [40]:
#Accessing specific information from a index
df.loc[0,['Name', 'Age']]

Name    Alice
Age        25
Name: 0, dtype: object

In [41]:
#Accessing all rows but only selective column
df.loc[:,["Name"]]

Unnamed: 0,Name
0,Alice
1,Bob
2,Charlie
3,David
4,Eve


In [42]:
#Conditionally Selecting Rows from DataFrame
df.loc[df['Age'] > 30]

Unnamed: 0,Name,Age,City
1,Bob,32,Paris
3,David,47,Tokyo
4,Eve,33,Sydney




<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Pandas .iloc</strong>
<br>
• In Pandas, the .iloc property is used to access and modify data within a DataFrame using integer-based indexing. 
<br>
• It allows us to select specific rows and columns based on their integer locations.
<br>
• The syntax of <code>.iloc</code> in Pandas is : <code>df.iloc[row_indexer, column_indexer]</code>
<br>
→ <strong>row_indexer -</strong> Selects rows by their integer location and can be a single integer, a list of integers or a boolean array.
<br>
→ <strong>column_indexer -</strong> Selects columns and can also be a single integer, a list of integers or a boolean array.
<div style="background-color:#ADD8E6; padding:8px; border:1px solid #87CEEB; border-radius:4px;">
<strong>Note :</strong> iloc() does not support conditional selection of rows, like loc().
</div>
</div>

In [43]:
#Creating a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 32, 18, 47, 33],
    'City': ['New York', 'Paris', 'London', 'Tokyo', 'Sydney']
}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,32,Paris
2,Charlie,18,London
3,David,47,Tokyo
4,Eve,33,Sydney


In [44]:
#Accessing a Single Row
df.iloc[0]

Name       Alice
Age           25
City    New York
Name: 0, dtype: object

In [45]:
#Accessing Multiple Rows
df.iloc[[0,2,4]]

Unnamed: 0,Name,Age,City
0,Alice,25,New York
2,Charlie,18,London
4,Eve,33,Sydney


In [46]:
#Accessing specific information from a index
df.iloc[0,2]

'New York'

In [47]:
#Accessing all rows but only selective column
df.iloc[:,[0,2]]

Unnamed: 0,Name,City
0,Alice,New York
1,Bob,Paris
2,Charlie,London
3,David,Tokyo
4,Eve,Sydney




<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<font color='#6a66bd' size="5px">Create Indexes in Pandas</font>
<br>
<br>
• Pandas offers several ways to create indexes. Some common methods are as follows :
<br>
→ Default Index
<br>
→ Setting Index
<br>
→ Creating a Range Index
</div>

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Default Index</strong>
<br>
• When we create a DataFrame or Series without specifying an index explicitly, Pandas assigns a default integer index starting from 0.
</div>

In [48]:
#Creating DataFrame with default index
data = {'Name': ['John', 'Alice', 'Bob'],
        'Age': [25, 28, 32],
        'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
print(df)

    Name  Age      City
0   John   25  New York
1  Alice   28    London
2    Bob   32     Paris


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Setting Index</strong>
<br>
• We can set an existing column as the index using the set_index() method.
<div style="background-color:#ADD8E6; padding:8px; border:1px solid #87CEEB; border-radius:4px;">
<strong>Note :</strong> The inplace=True parameter performs the operation directly on the object itself, without creating a new object.
</div>
<div style="background-color:#ADD8E6; padding:8px; border:1px solid #87CEEB; border-radius:4px;">
<strong>Note :</strong>  When we specify inplace=True, the original object is modified and the changes are directly applied.
</div>
</div>

In [49]:
#Using Name Column as index in DataFrame
data = {'Name': ['John', 'Alice', 'Bob'],
        'Age': [25, 28, 32],
        'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
df.set_index('Name', inplace=True)
print(df)

       Age      City
Name                
John    25  New York
Alice   28    London
Bob     32     Paris


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Creating a Range Index</strong>
<br>
• We can create a range index with specific start and end values using the RangeIndex() function.
</div>

In [50]:
#Creating a DataFrame with index defined as range
data = {'Name': ['John', 'Alice', 'Bob'],
        'Age': [25, 28, 32],
        'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
df = pd.DataFrame(data, index=pd.RangeIndex(1, 4, name='Index'))
print(df)

        Name  Age      City
Index                      
1       John   25  New York
2      Alice   28    London
3        Bob   32     Paris




<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<font color='#6a66bd' size="5px">Modifying Indexes in Pandas</font>
<br>
<br>
• Pandas allows us to make changes to indexes easily. Some common modification operations are :
<br>
→ Renaming Index
<br>
→ Resetting Index
</div>

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Renaming Index</strong>
<br>
• We can rename an index using the rename() method.
</div>

In [51]:
#Renaming the index value of the DataFrame
data = {'Name': ['John', 'Alice', 'Bob'],
        'Age': [25, 28, 32],
        'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
print('Original DataFrame ↓')
print(df)

print()

df.rename(index={0: 'A', 1: 'B', 2: 'C'}, inplace=True)
print('Modified DataFrame ↓')
print(df)

Original DataFrame ↓
    Name  Age      City
0   John   25  New York
1  Alice   28    London
2    Bob   32     Paris

Modified DataFrame ↓
    Name  Age      City
A   John   25  New York
B  Alice   28    London
C    Bob   32     Paris


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Resetting Index</strong>
<br>
We can reset the index to the default integer index using the reset_index() method. 
</div>

In [52]:
#Resetting the index of the DataFrame
data = {'Name': ['John', 'Alice', 'Bob'],
        'Age': [25, 28, 32],
        'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
df.rename(index={0: 'A', 1: 'B', 2: 'C'}, inplace=True)

print('Original DataFrame ↓')
print(df)

print()

df.reset_index(inplace=True)
print('Modified DataFrame ↓')
print(df)

Original DataFrame ↓
    Name  Age      City
A   John   25  New York
B  Alice   28    London
C    Bob   32     Paris

Modified DataFrame ↓
  index   Name  Age      City
0     A   John   25  New York
1     B  Alice   28    London
2     C    Bob   32     Paris




<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<font color='#6a66bd' size="5px">Pandas DataFrame Manipulation</font>
<br>
<br>
• DataFrame manipulation in Pandas involves editing and modifying existing DataFrames. 
<br>
• Some common DataFrame manipulation operations are :
<br>
→ Adding rows/columns
<br>
→ Removing rows/columns
<br>
→ Renaming rows/columns
</div>

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Add a New Column to a Pandas DataFrame</strong>
<br>
• We can add a new column to an existing Pandas DataFrame by simply declaring a new list as a column.
</div>

In [53]:
#Adding a new column in DataFrame
data = {'Name': ['John', 'Emma', 'Michael', 'Sophia'],
        'Height': [5.5, 6.0, 5.8, 5.3],
        'Qualification': ['BSc', 'BBA', 'MBA', 'BSc']}
df = pd.DataFrame(data)
print("Original DataFrame ↓")
print(df)

print()

address = ['New York', 'London', 'Sydney', 'Toronto']
df['Address'] = address
print("Modified DataFrame ↓")
print(df)

Original DataFrame ↓
      Name  Height Qualification
0     John     5.5           BSc
1     Emma     6.0           BBA
2  Michael     5.8           MBA
3   Sophia     5.3           BSc

Modified DataFrame ↓
      Name  Height Qualification   Address
0     John     5.5           BSc  New York
1     Emma     6.0           BBA    London
2  Michael     5.8           MBA    Sydney
3   Sophia     5.3           BSc   Toronto


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Add a New Row to a Pandas DataFrame</strong>
<br>
• Adding rows to a DataFrame is not quite as straightforward as adding columns in Pandas.
<br>
• We use the .loc property to add a new row to a Pandas DataFrame.
</div>

In [54]:
#Adding a new row in DataFrame
data = {'Name': ['John', 'Emma', 'Michael', 'Sophia'],
        'Height': [5.5, 6.0, 5.8, 5.3],
        'Qualification': ['BSc', 'BBA', 'MBA', 'BSc']}
df = pd.DataFrame(data)
print("Original DataFrame ↓")
print(df)

print()

df.loc[len(df.index)] = ['Amy', 5.2, 'BIT'] 
print("Modified DataFrame ↓")
print(df)

Original DataFrame ↓
      Name  Height Qualification
0     John     5.5           BSc
1     Emma     6.0           BBA
2  Michael     5.8           MBA
3   Sophia     5.3           BSc

Modified DataFrame ↓
      Name  Height Qualification
0     John     5.5           BSc
1     Emma     6.0           BBA
2  Michael     5.8           MBA
3   Sophia     5.3           BSc
4      Amy     5.2           BIT


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Remove Rows/Columns from a Pandas DataFrame</strong>
<br>
• We can use drop() to delete rows and columns from a DataFrame.
</div>

In [55]:
#Removing a row from a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Felipe', 'Rita'],
        'Age': [25, 30, 35, 40, 22, 29],
        'City': ['New York', 'London', 'Paris', 'Tokyo', 'Bogota', 'Banglore']}
df = pd.DataFrame(data)
print("Original DataFrame ↓")
print(df)

print()

df.drop(4, axis=0, inplace=True)
print("Modified DataFrame ↓")
print(df)

Original DataFrame ↓
      Name  Age      City
0    Alice   25  New York
1      Bob   30    London
2  Charlie   35     Paris
3    David   40     Tokyo
4   Felipe   22    Bogota
5     Rita   29  Banglore

Modified DataFrame ↓
      Name  Age      City
0    Alice   25  New York
1      Bob   30    London
2  Charlie   35     Paris
3    David   40     Tokyo
5     Rita   29  Banglore


In [56]:
#Removing a column from a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Felipe', 'Rita'],
        'Age': [25, 30, 35, 40, 22, 29],
        'City': ['New York', 'London', 'Paris', 'Tokyo', 'Bogota', 'Banglore']}
df = pd.DataFrame(data)
print("Original DataFrame ↓")
print(df)

print()

df.drop("Age", axis=1, inplace=True)
print("Modified DataFrame ↓")
print(df)

Original DataFrame ↓
      Name  Age      City
0    Alice   25  New York
1      Bob   30    London
2  Charlie   35     Paris
3    David   40     Tokyo
4   Felipe   22    Bogota
5     Rita   29  Banglore

Modified DataFrame ↓
      Name      City
0    Alice  New York
1      Bob    London
2  Charlie     Paris
3    David     Tokyo
4   Felipe    Bogota
5     Rita  Banglore


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Rename Labels in a DataFrame</strong>
<br>
• We can rename columns in a Pandas DataFrame using the rename() function.
</div>

In [57]:
#Renaming Column Label of a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 40],
        'City': ['New York', 'London', 'Paris', 'Tokyo']}
df = pd.DataFrame(data)
print("Original DataFrame ↓")
print(df)

print()

df.rename(columns={'Name': 'First_Name', 'Age': 'Number', 'City':'Address'}, inplace=True)
print("Modified DataFrame ↓")
print(df)

Original DataFrame ↓
      Name  Age      City
0    Alice   25  New York
1      Bob   30    London
2  Charlie   35     Paris
3    David   40     Tokyo

Modified DataFrame ↓
  First_Name  Number   Address
0      Alice      25  New York
1        Bob      30    London
2    Charlie      35     Paris
3      David      40     Tokyo


In [58]:
#Renaming Row Label of a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 40],
        'City': ['New York', 'London', 'Paris', 'Tokyo']}
df = pd.DataFrame(data)
print("Original DataFrame ↓")
print(df)

print()

df.rename(index={0: 100, 1: 200, 2: 300, 3: 400}, inplace=True)
print("Modified DataFrame ↓")
print(df)

Original DataFrame ↓
      Name  Age      City
0    Alice   25  New York
1      Bob   30    London
2  Charlie   35     Paris
3    David   40     Tokyo

Modified DataFrame ↓
        Name  Age      City
100    Alice   25  New York
200      Bob   30    London
300  Charlie   35     Paris
400    David   40     Tokyo


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<font color='#6a66bd' size="5px">Pandas Select</font>
<br>
<br>
• Pandas select refers to the process of extracting specific portions of data from a DataFrame.
<br>
• Data selection involves choosing specific rows and columns based on labels, positions or conditions.
</div>

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Select Data Using Indexing and Slicing</strong>
<br>
• In Pandas, we can use square brackets and their labels or positions to select the data we want.
</div>

In [59]:
#Select Data Using Indexing and Slicing
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 30, 22, 27, 29],
    'Salary': [50000, 60000, 45000, 55000, 52000]
}
df = pd.DataFrame(data)
print("Selecting single column : Name")
print(df["Name"])

print()

print("Selecting multiple columns : Age and Salary")
print(df[["Age","Salary"]])

print()

print("Selecting rows from 1 to 3")
print(df[1:4])
print()

Selecting single column : Name
0      Alice
1        Bob
2    Charlie
3      David
4        Eve
Name: Name, dtype: object

Selecting multiple columns : Age and Salary
   Age  Salary
0   25   50000
1   30   60000
2   22   45000
3   27   55000
4   29   52000

Selecting rows from 1 to 3
      Name  Age  Salary
1      Bob   30   60000
2  Charlie   22   45000
3    David   27   55000



<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Select Rows Based on Specific Criteria</strong>
<br>
• In Pandas, we can use boolean conditions to filter rows based on specific criteria.
</div>

In [60]:
#Select Rows Based on Specific Criteria
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
    'Age': [25, 30, 22, 28, 24],
    'Gender': ['Female', 'Male', 'Male', 'Male', 'Female']
}
df = pd.DataFrame(data)
print(df[df['Age'] > 25])

    Name  Age Gender
1    Bob   30   Male
3  David   28   Male


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Select Data using query()</strong>
<br>
• The query() method in Pandas allows you to select data using a more SQL-like syntax.
</div>

In [61]:
#Select Data using query()
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [25, 30, 22, 28, 35],
    'Score': [85, 90, 75, 80, 95]
}
df = pd.DataFrame(data)
print(df.query("Score > 80"))

    Name  Age  Score
0  Alice   25     85
1    Bob   30     90
4    Eva   35     95


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Filter Data By Labels</strong>
<br>
• We can use the filter() function to select columns by their names or labels.
</div>

In [62]:
#Filter data by using filter() function
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Department': ['HR', 'Marketing', 'Marketing', 'IT'],
        'Salary': [50000, 60000, 55000, 70000]}
df = pd.DataFrame(data)
print(df.filter(items=['Name', 'Salary']))

      Name  Salary
0    Alice   50000
1      Bob   60000
2  Charlie   55000
3    David   70000


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Select Rows Based on a List of Values</strong>
<br>
• Pandas provides us with the method named isin() to filter rows based on a list of values.
</div>

In [63]:
#Select Rows Based on a List of Values
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
    'Age': [25, 30, 22, 28, 24]
}
df = pd.DataFrame(data)
print(df[df['Name'].isin(['Bob', 'David'])])

    Name  Age
1    Bob   30
3  David   28


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<font color='#6a66bd' size="5px">Pandas MultiIndex</font>
<br>
<br>
• A MultiIndex in Pandas is a hierarchical indexing structure that allows us to represent and work with higher-dimensional data efficiently.
<br>
• While a typical index refers to a single column, a MultiIndex contains multiple levels of indexes. 
<br>
• Each column in a MultiIndex is linked to one another through a parent/relationship.
</div>

In [64]:
#Creating a DataFrame
data = {
    "Continent": ["North America", "Europe", "Asia", "North America", "Asia", "Europe", "North America", "Asia", "Europe", "Asia"],
    "Country": ["United States", "Germany", "China", "Canada", "Japan", "France", "Mexico", "India", "United Kingdom", "Nepal"],
    "Population": [331002651, 83783942, 1439323776, 37742154, 126476461, 65273511, 128932753, 1380004385, 67886011, 29136808]
}
df = pd.DataFrame(data)
print(df)

       Continent         Country  Population
0  North America   United States   331002651
1         Europe         Germany    83783942
2           Asia           China  1439323776
3  North America          Canada    37742154
4           Asia           Japan   126476461
5         Europe          France    65273511
6  North America          Mexico   128932753
7           Asia           India  1380004385
8         Europe  United Kingdom    67886011
9           Asia           Nepal    29136808


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
• Notice the redundancy in the Continent column. North America and Europe are repeated three times each while Asia is repeated four times.
<br>
• Additionally, we have arranged the entries in a random order and used integer values as index for the rows, thus complicating the task of locating data for a particular country. 
<br>
• This task becomes tedious as the size of the data set grows.
<br>
• In situations like this, hierarchical indexing makes much more sense.
</div>

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Create MultiIndex in Pandas</strong>
<br>
• In Pandas, we achieve hierarchical indexing using the concept of MultiIndex.
</div>

In [65]:
#Creating a MultiIndex DataFrame 
data = {
    "Continent": ["North America", "Europe", "Asia", "North America", "Asia", "Europe", "North America", "Asia", "Europe", "Asia"],
    "Country": ["United States", "Germany", "China", "Canada", "Japan", "France", "Mexico", "India", "United Kingdom", "Nepal"],
    "Population": [331002651, 83783942, 1439323776, 37742154, 126476461, 65273511, 128932753, 1380004385, 67886011, 29136808]
}
df = pd.DataFrame(data)
df.sort_values('Continent', inplace=True)

df.set_index(['Continent','Country'], inplace=True)

print(df)

                              Population
Continent     Country                   
Asia          China           1439323776
              Japan            126476461
              India           1380004385
              Nepal             29136808
Europe        Germany           83783942
              France            65273511
              United Kingdom    67886011
North America United States    331002651
              Canada            37742154
              Mexico           128932753


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
• In the above example, we first sorted the values in the DataFrame based on the Continent column. This groups the entries of the same continent together.
<br>
• We then created a MultiIndex by passing a list of columns as an argument to the set_index() function.
<br>
• Notice the order of the columns in the list. Continent comes first as it is the parent column and Country comes second as it is the child of Continent.
</div>

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Access Rows With MultiIndex</strong>
<br>
• We can access rows with MultiIndex by providing the full hierarchical index in the form of a tuple.
</div>

In [66]:
#Access only Continents
df.loc['Asia']

Unnamed: 0_level_0,Population
Country,Unnamed: 1_level_1
China,1439323776
Japan,126476461
India,1380004385
Nepal,29136808


In [67]:
#Access Continents and Countries
df.loc[('Asia','India')]

Population    1380004385
Name: (Asia, India), dtype: int64

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<font color='#6a66bd' size="5px">Pandas Reshape</font>
<br>
<br>
• In Pandas, reshaping data refers to the process of converting a DataFrame from one format to another for better data visualization and analysis.
</div>

In [68]:
#Reshape Data Using pivot()
data = {'Date': ['2023-08-01', '2023-08-01', '2023-08-02', '2023-08-02'],
        'Category': ['A', 'B', 'A', 'B'],
        'Value': [10, 20, 30, 40]}
df = pd.DataFrame(data)
print("Original Dataframe ↓")
print(df)

print()

pivot_df = df.pivot(index='Date', columns='Category', values='Value')
print("Reshaped DataFrame ↓")
print(pivot_df)

Original Dataframe ↓
         Date Category  Value
0  2023-08-01        A     10
1  2023-08-01        B     20
2  2023-08-02        A     30
3  2023-08-02        B     40

Reshaped DataFrame ↓
Category     A   B
Date              
2023-08-01  10  20
2023-08-02  30  40




<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Reshape Data Using pivot()</strong>
<br>
• In Pandas, the pivot() function reshapes data based on column values.
<br>
• It takes simple column-wise data as input and groups the entries into a two-dimensional table.
<br>
• We have to pass parameters index, columns and values to the pivot function,
<br>
→ <strong>index :</strong> specifies the column to be used as the index for the pivoted DataFrame.
<br>
→ <strong>columns :</strong> specifies the column whose unique values will become the new column headers.
<br>
→ <strong>values :</strong> specifies the column containing the values to be placed in the new columns.
</div>

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>pivot() for Multiple Values</strong>
<br>
• If we omit the values argument in pivot(), it selects all the remaining columns (besides the ones specified index and columns) as values for the pivot table.
</div>

In [69]:
#pivot() for multiple values
data = {'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
        'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles'],
        'Temperature': [32, 75, 30, 77],
        'Humidity': [80, 10, 85, 5]}
df = pd.DataFrame(data)
print('Original DataFrame ↓')
print(df)

print()

pivot_df = df.pivot(index='Date', columns='City')
print('Reshaped DataFrame ↓')
print(pivot_df)

Original DataFrame ↓
         Date         City  Temperature  Humidity
0  2023-01-01     New York           32        80
1  2023-01-01  Los Angeles           75        10
2  2023-01-02     New York           30        85
3  2023-01-02  Los Angeles           77         5

Reshaped DataFrame ↓
           Temperature             Humidity         
City       Los Angeles New York Los Angeles New York
Date                                                
2023-01-01          75       32          10       80
2023-01-02          77       30           5       85


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Reshape Data Using pivot_table()</strong>
<br>
• The pivot_table() function in Pandas is a way for reshaping and summarizing data in a DataFrame.
<br>
• It allows us to create a pivot table that aggregates and summarizes data based on the specified index, columns and aggregation functions.
</div>

In [70]:
#Reshape Data Using pivot_table()
data = {'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
        'Value': [10, 20, 30, 40, 50, 60]}
df = pd.DataFrame(data)
print("Original Dataframe ↓")
print(df)

print()

pivot_table_df = df.pivot_table(index='Category', values='Value', aggfunc='mean')
print("Reshaped Dataframe ↓")
print(pivot_table_df)

Original Dataframe ↓
  Category  Value
0        A     10
1        B     20
2        A     30
3        B     40
4        A     50
5        B     60

Reshaped Dataframe ↓
          Value
Category       
A          30.0
B          40.0




<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>pivot_table() Syntax</strong>
<br>
• The syntax of pivot_table() in Pandas is :
<br>
<code>df.pivot_table(values=None, index=None, columns=None, aggfunc='mean', fill_value=None, dropna=True)</code>
<br>
→ <strong>index</strong> : the column to use as row labels.
<br>
→ <strong>columns</strong> : the column that will be reshaped as columns.
<br>
→ <strong>values</strong> : the column to use for the new DataFrame values.
<br>
→ <strong>aggfunc</strong> : the function to use for aggregation, defaulting to 'mean', other option such as 'sum', 'mean', 'count', 'max' and 'min'.
<br>
→ <strong>fill_value</strong> : value to replace missing values with.
<br>
→ <strong>dropna</strong> : whether to exclude the columns whose entries are all NaN.
</div>

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<font color='#6a66bd' size="5px">Pandas CSV</font>
<br>
<br>
• Pandas provides functions for both reading from and writing to CSV files.
<br>
• CSV stands for Comma-Separated Values. 
<br>
• It is a popular file format used for storing tabular data, where each row represents a record and columns are separated by a delimiter (generally a comma).
</div>

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Read CSV Files</strong>
<br>
• In Pandas, the read_csv() function allows us to read data from a CSV file into a DataFrame. 
<br>
• It automatically detects commas and parses the data into appropriate columns.
</div>

In [71]:
#Reading a CSV File
df = pd.read_csv("canada_per_capita_income.csv")
df.head()

Unnamed: 0,year,per_capita_income
0,1970,3399.299037
1,1971,3768.297935
2,1972,4251.175484
3,1973,4804.463248
4,1974,5576.514583




<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>read_csv() Syntax</strong>
<br>
• The syntax of read_csv() in Pandas is :
<br>
<code>df = pd.read_csv(
    filepath_or_buffer,
    sep=',',
    header=0,
    names=['col1', 'col2', 'col3'],
    index_col='col1',
    usecols=['col1', 'col3'],
    skiprows=[1, 3],
    nrows=100,
    skipinitialspace=True
)
</code>
→ <strong>filepath_or_buffer :</strong> represents the path or buffer object containing the CSV data to be read.
<br>
→ <strong>sep(optional) :</strong> specifies the delimiter used in the CSV file.
<br>
→ <strong>header(optional) :</strong> indicates the row number to be used as the header or column names.
<br>
→ <strong>names(optional) :</strong> a list of column names to assign to the DataFrame.
<br>
→ <strong>index_col(optional) :</strong> specifies the column to be used as the index of the DataFrame.
<br>
→ <strong>usecols(optional) :</strong> a list of columns to be read and included in the DataFrame.
<br>
→ <strong>skiprows(optional) :</strong> used to skip specific rows while reading the CSV file.
<br>
→ <strong>nrows(optional) :</strong> sets the maximum number of rows to be read from the CSV file.
<br>
→ <strong>skipinitialspace(optional) :</strong> determines whether to skip any whitespace after the delimiter in each field.
</div>

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Write to CSV Files</strong>
<br>
• We used read_csv() to read data from a CSV file into a DataFrame.
<br>
• Pandas also provides the to_csv() function to write data from a DataFrame into a CSV file.
<div style="background-color:#ADD8E6; padding:8px; border:1px solid #87CEEB; border-radius:4px;">
<strong>Note :</strong> The index=False parameter is used to exclude the index labels from the CSV file.
</div>
</div>

In [72]:
#Writting a CSV File
data = {'Name': ['John', 'Alice', 'Bob'],
        'Age': [25, 30, 35],
        'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
df.to_csv('output.csv', index=False)



<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>to_csv() Syntax</strong>
<br>
• The syntax of to_csv() in Pandas is :
<br>
<code>df.to_csv(
    path_or_buf,
    sep=',',
    header=True,
    index=False,
    mode='w',
    encoding=None,
    quoting=None,
    line_terminator='\n',
)
</code>
→ <strong>path_or_buf :</strong> represents the path or buffer object where the DataFrame will be saved as a CSV file.
<br>
→ <strong>sep(optional) :</strong> specifies the delimiter to be used in the output CSV file.
<br>
→ <strong>header(optional) :</strong> indicates whether to include the header row in the output CSV file.
<br>
→ <strong>index(optional) :</strong> determines whether to include the index column in the output CSV file.
<br>
→ <strong>mode(optional) :</strong> specifies the mode in which the output file will be opened.
<br>
→ <strong>encoding(optional) :</strong> sets the character encoding to be used when writing the CSV file.
<br>
→ <strong>quoting(optional) :</strong> determines the quoting behavior for fields that contain special characters.
<br>
→ <strong>line_terminator(optional) :</strong> specifies the character sequence used to terminate lines in the output CSV file.
</div>



<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<font color='#6a66bd' size="5px">Pandas Cleaning Data</font>
<br>
<br>
• Data cleaning means fixing and organizing messy data. 
<br>
• Pandas offers a wide range of tools and functions to help us clean and preprocess our data effectively.
<br>
• Data cleaning often involves :
<br>
→ Dropping irrelevant columns.
<br>
→ Renaming column names to meaningful names.
<br>
→ Making data values consistent.
<br>
→ Replacing or filling in missing values.
</div>

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Drop Rows With Missing Values</strong>
<br>
• In Pandas, we can drop rows with missing values using the dropna() function.
</div>

In [73]:
#Dropping the NaN Values from the DataFrame
data = {
    'A': [1, 2, 3, None, 5],  
    'B': [None, 2, 3, 4, 5],  
    'C': [1, 2, None, None, 5]
}
df = pd.DataFrame(data)
print("Original Data ↓")
print(df)

print()

df_cleaned = df.dropna()
print("Cleaned Data  ↓")
print(df_cleaned)

Original Data ↓
     A    B    C
0  1.0  NaN  1.0
1  2.0  2.0  2.0
2  3.0  3.0  NaN
3  NaN  4.0  NaN
4  5.0  5.0  5.0

Cleaned Data  ↓
     A    B    C
1  2.0  2.0  2.0
4  5.0  5.0  5.0


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Fill Missing Values</strong>
<br>
• To fill the missing values in Pandas, we use the fillna() function.
<div style="background-color:#ADD8E6; padding:8px; border:1px solid #87CEEB; border-radius:4px;">
<strong>Note :</strong> While using agg function in fillna() method, remember to include [0] with mode() method like this <code>mode()[0]</code> to returns the most frequent value.
</div>
</div>

In [74]:
#Filling the missing Values
data = {
    'A': [1, 2, 3, None, 5],  
    'B': [None, 2, 3, 4, 5],  
    'C': [1, 2, None, None, 5]
}
df = pd.DataFrame(data)
print("Original Data ↓")
print(df)

print()

df.fillna(0, inplace=True)
print("Data after filling NaN with 0 ↓")
print(df)

Original Data ↓
     A    B    C
0  1.0  NaN  1.0
1  2.0  2.0  2.0
2  3.0  3.0  NaN
3  NaN  4.0  NaN
4  5.0  5.0  5.0

Data after filling NaN with 0 ↓
     A    B    C
0  1.0  0.0  1.0
1  2.0  2.0  2.0
2  3.0  3.0  0.0
3  0.0  4.0  0.0
4  5.0  5.0  5.0


In [75]:
#Filling the missing Values with aggregate functions
data = {
    'A': [1, 2, 3, None, 5],  
    'B': [None, 2, 3, 4, 5],  
    'C': [1, 2, None, None, 5]
}
df = pd.DataFrame(data)
print("Original Data ↓")
print(df)

print()

df.fillna(df.mean(), inplace=True)
print("Data after filling NaN with 0 ↓")
print(df)

Original Data ↓
     A    B    C
0  1.0  NaN  1.0
1  2.0  2.0  2.0
2  3.0  3.0  NaN
3  NaN  4.0  NaN
4  5.0  5.0  5.0

Data after filling NaN with 0 ↓
      A    B         C
0  1.00  3.5  1.000000
1  2.00  2.0  2.000000
2  3.00  3.0  2.666667
3  2.75  4.0  2.666667
4  5.00  5.0  5.000000


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Pandas Handling Duplicate Values</strong>
<br>
• In large datasets, we often encounter duplicate entries in tables. These duplicate entries can throw off our analysis and skew the results.
<br>
• Pandas provides several methods to find and remove duplicate entries in DataFrames.
<br>
<br>
<strong>Find Duplicate Entries</strong>
<br>
• We can find duplicate entries in a DataFrame using the duplicated() method. It returns True if a row is duplicated and returns False otherwise.
</div>

In [76]:
#Finding Duplicate Entries
data = {
    'Name': ['John', 'Anna', 'John', 'Anna', 'John'],
    'Age': [28, 24, 28, 24, 19],
    'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df.duplicated())

0    False
1    False
2     True
3     True
4    False
dtype: bool


In [77]:
#Checking for duplicate entries in Name and Age Columns
data = {
    'Name': ['John', 'Anna', 'Johnny', 'Anna', 'John'],
    'Age': [28, 24, 28, 24, 19],
    'City': ['New York', 'Las Vegas', 'New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df.duplicated(subset=['Name', 'Age']))

0    False
1    False
2    False
3     True
4    False
dtype: bool


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Remove Duplicate Entries</strong>
<br>
• We can remove duplicate entries in Pandas using the drop_duplicates() method.
</div>

In [78]:
#Removing Duplicate Entries from DataFrame
data = {
    'Name': ['John', 'Anna', 'John', 'Anna', 'John'],
    'Age': [28, 24, 28, 24, 19],
    'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
df.drop_duplicates(inplace=True)
print(df)

   Name  Age         City
0  John   28     New York
1  Anna   24  Los Angeles
4  John   19      Chicago


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Rename Column Names to Meaningful Names</strong>
<br>
• To rename column names to more meaningful names in Pandas, we can use the rename() function. 
</div>

In [79]:
#Renaming Column Names
data = {
    'A': [25, 30, 35],
    'B': ['John', 'Doe', 'Smith'],
    'C': [50000, 60000, 70000]
}
df = pd.DataFrame(data)
print("Original DataFrame ↓")
print(df)

print()

df.rename(columns={'A': 'Age', 'B': 'Name', 'C': 'Salary'}, inplace=True)
print("Modified DataFrame ↓")
print(df)

Original DataFrame ↓
    A      B      C
0  25   John  50000
1  30    Doe  60000
2  35  Smith  70000

Modified DataFrame ↓
   Age   Name  Salary
0   25   John   50000
1   30    Doe   60000
2   35  Smith   70000


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<font color='#6a66bd' size="5px">Pandas Handling Wrong Format Data</font>
<br>
<br>
• In a real world scenario, data are taken from various sources which causes inconsistencies in format of the data. 
<br>
• Such inconsistencies can create challenges, making data analysis difficult or even impossible.
<br>
<br>
<strong>Convert Data to Correct Format</strong>
<br>
• We can remove inconsistencies in data by converting a column with inconsistencies to a specific format.
</div>

In [80]:
#Converting data to a correct format
data = {
    'Country': ['USA', 'Canada', 'Australia', 'Germany', 'Japan'],
    'Date': ['2023-07-20', '2023-07-21', '2023-07-22', '2023-07-23', '2023-07-24'],
    'Temperature': [25.5, '28.0', 30.2, 22.8, 26.3]
}
df = pd.DataFrame(data)

df['Temperature'] = df['Temperature'].astype(float)
mean_temperature = df['Temperature'].mean()

print(mean_temperature)

26.560000000000002


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<font color='#6a66bd' size="5px">Data Analysis and Aggregation</font>
<br>
<br>
<strong>Pandas DateTime</strong>
<br>
• In Pandas, DateTime is a data type that represents a single point in time.
<br>
• It is especially useful when dealing with time-series data like stock prices, weather records, economic indicators etc.
<br>
• We use the to_datetime() function to convert strings to the DateTime object.
<br>
• Dates can be represented in various formats such as mm-dd-yyyy, dd-mm-yyyy, yyyy-mm-dd etc.
<br>
• Also, different separators such as /, -, . etc can be used.
<br>
• We can handle this issue by converting the column containing dates to the DateTime format.
</div>

In [81]:
#Converting string to DataTime Format
date_string = '2001 12 24 12:38'
print("String :", date_string)
print(type(date_string))

print()

date = pd.to_datetime(date_string)
print("DateTime :", date)
print(type(date))

String : 2001 12 24 12:38
<class 'str'>

DateTime : 2001-12-24 12:38:00
<class 'pandas._libs.tslibs.timestamps.Timestamp'>


In [82]:
#to_datetime() With Day First Format
df = pd.DataFrame({'date': ['13-02-2021', '22-03-2022', '30-04-2023']})
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
print(df)

        date
0 2021-02-13
1 2022-03-22
2 2023-04-30


In [83]:
#to_datetime() With Custom Format
df = pd.DataFrame({'date': ['2021/22/01', '2022/13/01', '2023/30/03']})
df['date'] = pd.to_datetime(df['date'], format='%Y/%d/%m')
print(df)

        date
0 2021-01-22
1 2022-01-13
2 2023-03-30


In [84]:
#Get DateTime From Multiple Columns
df = pd.DataFrame({'year': [2021, 2022, 2023],
                   'month': [1, 2, 3],
                   'day': [1, 2, 3],
                   'hour': [10, 11, 12],
                   'minute': [30, 45, 0],
                   'second': [0, 0, 0]})
df['datetime'] = pd.to_datetime(df[['year', 'month', 'day', 'hour', 'minute', 'second']])
print(df)

   year  month  day  hour  minute  second            datetime
0  2021      1    1    10      30       0 2021-01-01 10:30:00
1  2022      2    2    11      45       0 2022-02-02 11:45:00
2  2023      3    3    12       0       0 2023-03-03 12:00:00


In [85]:
#Get Year, Month and Day From DateTime
df = pd.DataFrame({'datetime': ['2021-01-11', '2022-02-22', '2023-03-31']})

df['datetime'] = pd.to_datetime(df['datetime'])

df['year'] = df['datetime'].dt.year
df['month'] = df['datetime'].dt.month
df['day'] = df['datetime'].dt.day

print(df)

    datetime  year  month  day
0 2021-01-11  2021      1   11
1 2022-02-22  2022      2   22
2 2023-03-31  2023      3   31


In [86]:
#Get Day of Week, Week of Year and Leap Year
df = pd.DataFrame({'datetime': ['2021-01-01', '2024-02-02', '2023-03-03']})

df['datetime'] = pd.to_datetime(df['datetime'])

df['day_of_week'] = df['datetime'].dt.day_name()

df['week_of_year'] = df['datetime'].dt.isocalendar().week

df['leap_year'] = df['datetime'].dt.is_leap_year

print(df)

    datetime day_of_week  week_of_year  leap_year
0 2021-01-01      Friday            53      False
1 2024-02-02      Friday             5       True
2 2023-03-03      Friday             9      False


In [87]:
#Converting the Date Column with mixed formats of Dates
df = pd.DataFrame({'date': ['2022-12-01', '01/02/2022', '2022-03-23', '03/02/2022', '3 4 2023', '2023.9.30']})
df['date'] = pd.to_datetime(df['date'], format='mixed', dayfirst=True)
print(df)

        date
0 2022-12-01
1 2022-02-01
2 2022-03-23
3 2022-02-03
4 2023-04-03
5 2023-09-30




<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Pandas Aggregate Function</strong>
<br>
• Aggregate function in Pandas performs summary computations on data, often on grouped data. But it can also be used on Series objects.
<br>
• This can be really useful for tasks such as calculating mean, sum, count and other statistics for different groups within our data.
<br>
<br>
<strong>Syntax</strong>
<br>
• Here's the basic syntax of the aggregate function : <code>df.aggregate(func, axis=0/1)</code>
<br>
→ <strong>func -</strong> an aggregate function like sum, mean, etc.
<br>
→ <strong>axis -</strong> specifies whether to apply the aggregation operation along rows or columns.
</div>

In [88]:
#Aggregate Function applied on DataFrame
data = {
    'Category': ['A', 'A', 'B', 'B', 'A', 'B'],
    'Value': [10, 15, 20, 25, 30, 35]
}
df = pd.DataFrame(data)

total_sum = df['Value'].aggregate('sum')
print("Total Sum :", total_sum)

average_value = df['Value'].aggregate('mean')
print("Average Value :", average_value)

max_value = df['Value'].aggregate('max')
print("Maximum Value :", max_value)

Total Sum : 135
Average Value : 22.5
Maximum Value : 35


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Pandas groupby</strong>
<br>
• In Pandas, the groupby operation lets us group data based on specific columns.
<br>
• This means we can divide a DataFrame into smaller groups based on the values in these columns.
<br>
• Once grouped, we can then apply functions to each group separately. These functions help summarize or aggregate the data in each group.
</div>

In [89]:
#Grouping the Data based on specific column and agg() function
data = {'Category': ['Electronics', 'Clothing', 'Electronics', 'Clothing'],
        'Sales': [1000, 500, 800, 300]}
df = pd.DataFrame(data)
grouped = df.groupby('Category')['Sales'].sum()
print(grouped)

Category
Clothing        800
Electronics    1800
Name: Sales, dtype: int64


In [90]:
#Grouping by multiple column
data = {
    'Gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
    'Grade': ['A', 'B', 'A', 'A', 'B'],
    'Score': [90, 85, 92, 88, 78]
}
df = pd.DataFrame(data)

grouped = df.groupby(['Gender', 'Grade']).aggregate(['sum','mean'])

print(grouped)

             Score      
               sum  mean
Gender Grade            
Female A        88  88.0
       B        85  85.0
Male   A       182  91.0
       B        78  78.0


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<font color='#6a66bd' size="5px">Pandas Sort</font>
<br>
<br>
• Sorting is a fundamental operation in data manipulation and analysis that involves arranging data in a specific order.
<br>
• Sorting is crucial for tasks such as organizing data for better readability, identifying patterns, making comparisons and facilitating further analysis.
<br>
<br>
<strong>Sort DataFrame in Pandas</strong>
<br>
• In Pandas, we can use the sort_values() function to sort a DataFrame.
</div>

In [91]:
#Sort DataFrame according to a specific column in ascending order
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [28, 22, 25]}
df = pd.DataFrame(data)

sorted_df = df.sort_values(by='Age')

print(sorted_df)

      Name  Age
1      Bob   22
2  Charlie   25
0    Alice   28


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
• To sort values in descending order, we use the ascending parameter as False :
</div>

In [92]:
#Sort DataFrame according to a specific column in descending order
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [28, 22, 25]}
df = pd.DataFrame(data)

sorted_df = df.sort_values(by='Age', ascending=False)

print(sorted_df)

      Name  Age
0    Alice   28
2  Charlie   25
1      Bob   22


In [93]:
#Sort Pandas DataFrame by Multiple Columns
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 22, 30, 22],
        'Score': [85, 90, 75, 80]}
df = pd.DataFrame(data)

df1 = df.sort_values(by=['Age', 'Score'], ascending=[True,True])
print("Sorting by 'Age' and then by 'Score' ↓")
print(df1)

print()

df2 = df.sort_values(by=['Age', 'Score'], ascending=[True, False])
print("Sorting by 'Age' (ascending) and then by 'Score' (descending) ↓")
print(df2)

Sorting by 'Age' and then by 'Score' ↓
      Name  Age  Score
3    David   22     80
1      Bob   22     90
0    Alice   25     85
2  Charlie   30     75

Sorting by 'Age' (ascending) and then by 'Score' (descending) ↓
      Name  Age  Score
1      Bob   22     90
3    David   22     80
0    Alice   25     85
2  Charlie   30     75


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Sort Pandas DataFrame Using sort_index()</strong>
<br>
• The sort_index() function is used to sort a DataFrame or Series by its index. 
<br>
• This is useful for organizing data in a logical order, improving query performance and ensuring consistent data representation.
</div>

In [94]:
#Sorting DataFrame by using sort_index()
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [28, 22, 25]}
df = pd.DataFrame(data, index=[2, 0, 1])

print("Original DataFrame ↓")
print(df.to_string(index=True))

print()

sorted_df = df.sort_index()
print("Sorted DataFrame by index ↓")
print(sorted_df.to_string(index=True))

Original DataFrame ↓
      Name  Age
2    Alice   28
0      Bob   22
1  Charlie   25

Sorted DataFrame by index ↓
      Name  Age
0      Bob   22
1  Charlie   25
2    Alice   28


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<font color='#6a66bd' size="5px">Pandas Correlation</font>
<br>
<br>
• Correlation is a statistical concept that quantifies the degree to which two variables are related to each other.
<br>
• Correlation can be calculated in Pandas using the corr() function.
<div style="background-color:#ADD8E6; padding:8px; border:1px solid #87CEEB; border-radius:4px;">
<strong>Note :</strong> DataFrame may contain missing values (NaN). The corr() function completely ignores the rows with NaN values.
</div>
</div>

In [95]:
#Correlation of a DataFrame
data = {
    "Temperature": [22, 25, 32, 28, 30],
    "Ice_Cream_Sales": [105, 120, 135, 130, 125]
}
df = pd.DataFrame(data)
print(df.corr())

                 Temperature  Ice_Cream_Sales
Temperature         1.000000         0.923401
Ice_Cream_Sales     0.923401         1.000000


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Positive and Negative Correlation</strong>
<br>
• Positive correlation refers to a relationship between two variables where they both tend to change in the same direction. 
<br>
• When one variable increases, the other variable also tends to increase and when one variable decreases, the other variable also tends to decrease.
<br>
• Negative correlation, on the other hand, refers to a relationship between two variables where they tend to change in opposite directions.
<br>
• When one variable increases, the other variable tends to decrease and vice versa.
</div>

In [96]:
#Correlation between 2 Columns
data = {
    "Temperature": [22, 25, 32, 28, 30],
    "Ice_Cream_Sales": [105, 120, 135, 130, 125]
}
df = pd.DataFrame(data)
correlation = df['Temperature'].corr(df["Ice_Cream_Sales"])
print(correlation)

0.9234007664064656




<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Correlation Methods in Pandas</strong>
<br>
• We can calculate correlation using three different methods in Pandas :
<br>
→ <strong>Pearson Method :</strong> evaluates the linear relationship between two continuous variables.
<br>
→ <strong>Kendall Method :</strong> measures the ordinal association between two measured quantities.
<br>
→ <strong>Spearman Method :</strong> evaluates the monotonic relationship between two continuous or ordinal variables.
<br>
• By default, corr() computes the Pearson correlation coefficient which measures the linear relationship between two variables.
</strong>

In [97]:
#Different correlation methods in Pandas
data = {
    "Temperature": [22, 25, 32, 28, 30],
    "Ice_Cream_Sales": [105, 120, 135, 130, 125]
}
df = pd.DataFrame(data)

pearson = df['Temperature'].corr(df["Ice_Cream_Sales"])
kendall = df['Temperature'].corr(df["Ice_Cream_Sales"], method='kendall')
spearman = df['Temperature'].corr(df["Ice_Cream_Sales"], method='spearman')

print(f"Pearson's Coefficient : {pearson}")
print(f"Kendall's Coefficient : {kendall}")
print(f"Spearman's Coefficient : {spearman}")

Pearson's Coefficient : 0.9234007664064656
Kendall's Coefficient : 0.7999999999999999
Spearman's Coefficient : 0.8999999999999998




<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Perfect, Good & Bad Correlation</strong>
<br>
→ <strong>Perfect Correlation</strong>
<br>
• A perfect positive correlation implies that for every increase in one variable, there is a proportionate increase in the other variable indicated by a coefficient of +1.
<br>
• A perfect negative correlation, represented by -1, signifies that an increase in one variable leads to a proportionate decrease in the other.
<br>
→ <strong>Good Correlation</strong>
<br>
• A good correlation can range from 0.5 to 0.9 (positive or negative) and generally indicates a strong relationship between the variables, but it doesn't mean the relationship is perfect.
<br>
→ <strong>Bad Correlation</strong>
<br>
• A bad correlation is typically close to zero, indicating that there is no relationship or any form of dependence between the two variables.
</div>

---

<p style="color:#6a66bd; text-align:center; font-weight:bold; font-family:verdana; font-size:25px;">Thanks 👏 for Visiting!</p>