**Pandas** is a Python library used for working with data sets.
It has functions for analyzing, cleaning, exploring, and manipulating data.

**Installation of Pandas**

In [156]:
!pip install pandas



**Import Pandas**

In [157]:
import pandas as pd #importing pandas and providing it with an alias

**Series**

A **Pandas Series** is like a column in a table.

It is a *one-dimensional array* holding data of any type.

In [158]:
a = [1, 7, 2, 4, 9, 8]

myNum = pd.Series(a)

print(myNum)

0    1
1    7
2    2
3    4
4    9
5    8
dtype: int64


In [159]:
print(myNum[0])
print(myNum[5])

1
8


**Labels**

With the *index* argument, you can name your own labels.

In [160]:
a = [10, 17, 21]

myNum = pd.Series(a, index = ["a", "b", "c"])

print(myNum)

a    10
b    17
c    21
dtype: int64


In [161]:
print(myNum["a"])

10


**Key/Value** Objects as Series

In [162]:
running = {"day1": 2, "day2": 3, "day3": 5}

myRun = pd.Series(running)

print(myRun)

day1    2
day2    3
day3    5
dtype: int64


**DataFrames**

Data sets in Pandas are usually *multi-dimensional* tables, called **DataFrames**.

Series is like a column, a DataFrame is the whole table.

In [163]:
data = {
  "kilometers": [4, 3, 5],
  "duration": [50, 40, 45]
}

myRun = pd.DataFrame(data)

print(myRun)

   kilometers  duration
0           4        50
1           3        40
2           5        45


In [164]:
print(myRun.info())       # Overview of the dataset
print(myRun.describe())   # Summary statistics

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   kilometers  3 non-null      int64
 1   duration    3 non-null      int64
dtypes: int64(2)
memory usage: 176.0 bytes
None
       kilometers  duration
count         3.0       3.0
mean          4.0      45.0
std           1.0       5.0
min           3.0      40.0
25%           3.5      42.5
50%           4.0      45.0
75%           4.5      47.5
max           5.0      50.0


In [165]:
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35]}
df = pd.DataFrame(data)
print(df)

      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35


**Loading Data from a File**

In [166]:
mydf = pd.read_csv('https://raw.githubusercontent.com/gagan-iitb/DataAnalyticsAndVisualization/refs/heads/main/Lab-W25/dataset/names.csv')

# New Section

In [167]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Download CSV - [names.csv](https://github.com/gagan-iitb/DataAnalyticsAndVisualization/blob/main/Lab-W25/dataset/names.csv)

In [168]:
print(mydf.head())  # Display the first 5 rows

      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
3    James   23
4     John   26


In [169]:
print(mydf.head(7))  # Display the first 7 rows

      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
3    James   23
4     John   26
5  William   28
6    Caleb   25


In [170]:
print(mydf['Name'])  # Single column

0      Alice
1        Bob
2    Charlie
3      James
4       John
5    William
6      Caleb
7      Helen
Name: Name, dtype: object


In [171]:
print(mydf[['Age', 'Name']])  # Multiple columns

   Age     Name
0   25    Alice
1   30      Bob
2   35  Charlie
3   23    James
4   26     John
5   28  William
6   25    Caleb
7   30    Helen


Filtering Rows

In [172]:
print(mydf[mydf['Age'] > 25])

      Name  Age
1      Bob   30
2  Charlie   35
4     John   26
5  William   28
7    Helen   30


Adding/Updating Columns

In [173]:
mydf['Salary'] = [50000, 60000, 50000, 50000, 30000, 70000, 90000, 80000]
print(mydf)

      Name  Age  Salary
0    Alice   25   50000
1      Bob   30   60000
2  Charlie   35   50000
3    James   23   50000
4     John   26   30000
5  William   28   70000
6    Caleb   25   90000
7    Helen   30   80000


**Saving to a File**

In [174]:
mydf.to_csv('myStudentDataFrame.csv', index=False)

Dropping Columns

In [175]:
mydf = mydf.drop('Salary', axis=1)  # Drop column

In [176]:
print(mydf)

      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
3    James   23
4     John   26
5  William   28
6    Caleb   25
7    Helen   30


In [177]:
#Create/Append two new columns named Marks, Department in mydf and display it
mydf['Marks']=[90,91,78,67,100,34,76,46]
mydf['Department']=['CSE','DSAI','DSAI','ME','EE','MT','ME','MSME']
print(mydf)

      Name  Age  Marks Department
0    Alice   25     90        CSE
1      Bob   30     91       DSAI
2  Charlie   35     78       DSAI
3    James   23     67         ME
4     John   26    100         EE
5  William   28     34         MT
6    Caleb   25     76         ME
7    Helen   30     46       MSME


In [178]:
#Save the newly create mydf to a csv file. (Name of file = myDataframe_YourIDNumber.csv)
mydf.to_csv('myDataframe_12341680.csv')

In [179]:
#Filter all the rows where Age falls between 25-30.
print(mydf[(mydf['Age']>=25) & (mydf['Age']<=30)])

      Name  Age  Marks Department
0    Alice   25     90        CSE
1      Bob   30     91       DSAI
4     John   26    100         EE
5  William   28     34         MT
6    Caleb   25     76         ME
7    Helen   30     46       MSME


Unique() function

In [180]:
mydf.Age.unique()

array([25, 30, 35, 23, 26, 28])

Sorting

In [181]:
mydf.sort_values(by=['Age'])

Unnamed: 0,Name,Age,Marks,Department
3,James,23,67,ME
0,Alice,25,90,CSE
6,Caleb,25,76,ME
4,John,26,100,EE
5,William,28,34,MT
1,Bob,30,91,DSAI
7,Helen,30,46,MSME
2,Charlie,35,78,DSAI


In [182]:
#Sort mydf dataframe on the basis of Name,Marks.
mydf.sort_values(by=['Name','Marks'])

Unnamed: 0,Name,Age,Marks,Department
0,Alice,25,90,CSE
1,Bob,30,91,DSAI
6,Caleb,25,76,ME
2,Charlie,35,78,DSAI
7,Helen,30,46,MSME
3,James,23,67,ME
4,John,26,100,EE
5,William,28,34,MT


Missing Data

In [183]:
data = {
    "Name": ["Alice", "Bob", "Charlie", "Diana", "Eve", "Frank", "Grace", "Hank"],
    "Gender": ["Female", "Male", None, "Female", None, "Male", "Female", None],
}
df = pd.DataFrame(data)
print(df)

      Name  Gender
0    Alice  Female
1      Bob    Male
2  Charlie    None
3    Diana  Female
4      Eve    None
5    Frank    Male
6    Grace  Female
7     Hank    None


In [184]:
print("\nCheck for missing values:")
print(pd.isnull(df))


Check for missing values:
    Name  Gender
0  False   False
1  False   False
2  False    True
3  False   False
4  False    True
5  False   False
6  False   False
7  False    True


In [185]:
print("\nCheck for missing values(Column):")
print(pd.isnull(df['Gender']))


Check for missing values(Column):
0    False
1    False
2     True
3    False
4     True
5    False
6    False
7     True
Name: Gender, dtype: bool


In [186]:
# Fill missing values in the 'Gender' column with a default value

df['Gender'] = df['Gender'].fillna("Not Specified")

In [187]:
#updated dataframe
print(df)

      Name         Gender
0    Alice         Female
1      Bob           Male
2  Charlie  Not Specified
3    Diana         Female
4      Eve  Not Specified
5    Frank           Male
6    Grace         Female
7     Hank  Not Specified


In [188]:
#Read myStudentDataFrame.csv
mydf=pd.read_csv('https://raw.githubusercontent.com/gagan-iitb/DataAnalyticsAndVisualization/refs/heads/main/Lab-W25/dataset/myStudentDataFrame.csv')
print(mydf)

           name  class  mark  gender
0      John Deo   Four  75.0  female
1      Max Ruin  Three  85.0    male
2        Arnold  Three  55.0    male
3    Krish Star   Four  60.0  female
4     John Mike   Four  60.0  female
5     Alex John   Four  55.0    male
6   My John Rob   Five  78.0    male
7        Asruid   Five  85.0    male
8       Tes Qry    Six  78.0    male
9      Big John   Four  55.0  female
10       Ronald    Six  89.0  female
11        Recky    Six  94.0     NaN
12          Kty    NaN  88.0  female
13         Bigy  Seven  88.0  female
14     Tade Row   Four  88.0    male
15        Gimmy   Four  88.0     NaN
16        Tumyu    NaN  54.0    male
17        Honny   Five   NaN    male
18        Tinny   Nine  18.0    male
19       Jackly   Nine  65.0  female
20   Babby John   Four  69.0  female
21       Reggid  Seven  55.0  female
22        Herod  Eight  79.0    male
23    Tiddy Now  Seven  78.0     NaN
24     Giff Tow  Seven  88.0    male
25       Crelea  Seven  79.0    male
2

In [189]:
#Check for missing data in all columns using appropriate pandas functions.
print(pd.isnull(mydf))

     name  class   mark  gender
0   False  False  False   False
1   False  False  False   False
2   False  False  False   False
3   False  False  False   False
4   False  False  False   False
5   False  False  False   False
6   False  False  False   False
7   False  False  False   False
8   False  False  False   False
9   False  False  False   False
10  False  False  False   False
11  False  False  False    True
12  False   True  False   False
13  False  False  False   False
14  False  False  False   False
15  False  False  False    True
16  False   True  False   False
17  False  False   True   False
18  False  False  False   False
19  False  False  False   False
20  False  False  False   False
21  False  False  False   False
22  False  False  False   False
23  False  False  False    True
24  False  False  False   False
25  False  False  False   False
26  False  False  False   False
27  False  False  False   False
28  False  False  False   False
29  False  False   True   False
30  Fals

In [190]:
#Drop Rows with Missing Data
mydf=mydf.dropna()

In [191]:
#Compute Summary Statistics (AVG,MEAN,MAX,MIN)
print(mydf.describe())

            mark
count  28.000000
mean   73.464286
std    17.125857
min    18.000000
25%    60.000000
50%    78.500000
75%    88.000000
max    96.000000


In [192]:
#Filter Data and Compute Pass/Fail
#mark >= 40: Pass
#mark < 40: Fail
mydf['Result'] = mydf['mark'].apply(lambda x: 'Pass' if x >= 40 else 'Fail')
print(mydf)

           name  class  mark  gender Result
0      John Deo   Four  75.0  female   Pass
1      Max Ruin  Three  85.0    male   Pass
2        Arnold  Three  55.0    male   Pass
3    Krish Star   Four  60.0  female   Pass
4     John Mike   Four  60.0  female   Pass
5     Alex John   Four  55.0    male   Pass
6   My John Rob   Five  78.0    male   Pass
7        Asruid   Five  85.0    male   Pass
8       Tes Qry    Six  78.0    male   Pass
9      Big John   Four  55.0  female   Pass
10       Ronald    Six  89.0  female   Pass
13         Bigy  Seven  88.0  female   Pass
14     Tade Row   Four  88.0    male   Pass
18        Tinny   Nine  18.0    male   Fail
19       Jackly   Nine  65.0  female   Pass
20   Babby John   Four  69.0  female   Pass
21       Reggid  Seven  55.0  female   Pass
22        Herod  Eight  79.0    male   Pass
24     Giff Tow  Seven  88.0    male   Pass
25       Crelea  Seven  79.0    male   Pass
26     Big Nose  Three  81.0  female   Pass
27    Rojj Base  Seven  86.0  fe

In [193]:
#Add a new column Result to the DataFrame indicating Pass or Fail.


In [194]:
#Save the Final DataFrame
#Save the updated DataFrame (with the Result column) to a new CSV file named Result_YourIDNumber.csv.
mydf.to_csv('Result_12341680.csv')

Additional Practice Questions - [Click Here](https://colab.research.google.com/drive/1_Hc9yV2RIgvau6BsLiYXn87orXqa52RX?usp=sharing)