<a href="https://colab.research.google.com/github/Nikulkumar-Dabhi/Data_Science/blob/main/pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Pandas** is a powerful and widely-used Python library for data analysis and manipulation. It provides data structures like DataFrames and Series that make it easy to work with structured data, such as tables and time series.

1. Installation

In [2]:
pip install pandas



2. Importing Pandas

In [3]:
import pandas as pd

3. Data Structures: Series and DataFrames

Series: A one-dimensional labeled array capable of holding any data type. Think of it like a column in a table.

In [4]:

# Creating a Series from a list
data = [10, 20, 30, 40, 50]
series = pd.Series(data)
print(series)

0    10
1    20
2    30
3    40
4    50
dtype: int64


DataFrame: A two-dimensional table-like structure with rows and columns. It's the most commonly used data structure in Pandas.

In [5]:
# Creating a DataFrame from a dictionary
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 28],
        'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
print(df)

      Name  Age      City
0    Alice   25  New York
1      Bob   30    London
2  Charlie   28     Paris


Generating Sample CSV

In [6]:
import csv
import random

# Define the number of students and courses
num_students = 100
num_courses = 6

# Create a list to store the data
data = []

# Generate data for each student
for i in range(num_students):
    roll_number = i + 1  # Assign roll numbers sequentially
    scores = [random.randint(0, 100) for _ in range(num_courses)]  # Generate random scores
    data.append([roll_number] + scores)  # Add roll number and scores to the data list

# Define the header row
header = ['Roll Number'] + [f'Course {j}' for j in range(1, num_courses + 1)]
import csv
import random

# Define the number of students and courses
num_students = 100
num_courses = 6

# Create a list to store the data
data = []

# Generate data for each student
for i in range(num_students):
    roll_number = i + 1  # Assign roll numbers sequentially
    scores = [random.randint(0, 100) for _ in range(num_courses)]  # Generate random scores
    data.append([roll_number] + scores)  # Add roll number and scores to the data list

# Define the header row
header = ['Roll Number'] + [f'Course {j}' for j in range(1, num_courses + 1)]

# Write the data to a CSV file
with open('student_scores.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(header)
    writer.writerows(data)

4. Reading and Writing Data

Pandas allows you to easily read and write data from various file formats like CSV, Excel, and more.

In [7]:
# Reading a CSV file into a DataFrame
df = pd.read_csv('student_scores.csv')  # Replace 'your_file.csv' with the actual file path
print(df)

    Roll Number  Course 1  Course 2  Course 3  Course 4  Course 5  Course 6
0             1        25        48        77        15        28        84
1             2        23        97        14         5        15        40
2             3        86         7        78        11        25        67
3             4        48        43        45         0         2         2
4             5        99         9        34        96        46         9
..          ...       ...       ...       ...       ...       ...       ...
95           96        29        60        75         6        56         2
96           97        44        91        90        28        62        29
97           98        16        20        16        32        21        83
98           99        62        93        55        33        72        32
99          100         2        78        17        11        25         1

[100 rows x 7 columns]


5. Basic Data Exploration

head() and tail(): Display the first or last few rows of the DataFrame.

In [8]:
# Display the first 5 rows
print(df.head(1))

# Display the last 3 rows
print(df.tail(3))

   Roll Number  Course 1  Course 2  Course 3  Course 4  Course 5  Course 6
0            1        25        48        77        15        28        84
    Roll Number  Course 1  Course 2  Course 3  Course 4  Course 5  Course 6
97           98        16        20        16        32        21        83
98           99        62        93        55        33        72        32
99          100         2        78        17        11        25         1


info(): Get a summary of the DataFrame, including data types and missing values.

In [9]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   Roll Number  100 non-null    int64
 1   Course 1     100 non-null    int64
 2   Course 2     100 non-null    int64
 3   Course 3     100 non-null    int64
 4   Course 4     100 non-null    int64
 5   Course 5     100 non-null    int64
 6   Course 6     100 non-null    int64
dtypes: int64(7)
memory usage: 5.6 KB
None


describe(): Get descriptive statistics of numerical columns.

In [10]:
print(df.describe())

       Roll Number    Course 1    Course 2    Course 3    Course 4  \
count   100.000000  100.000000  100.000000  100.000000  100.000000   
mean     50.500000   44.810000   47.660000   49.980000   46.080000   
std      29.011492   28.628126   31.325021   30.422606   31.049221   
min       1.000000    2.000000    0.000000    0.000000    0.000000   
25%      25.750000   21.750000   22.500000   24.000000   18.750000   
50%      50.500000   41.500000   44.500000   46.000000   42.000000   
75%      75.250000   67.000000   78.250000   77.250000   74.500000   
max     100.000000   99.000000  100.000000  100.000000  100.000000   

         Course 5    Course 6  
count  100.000000  100.000000  
mean    50.060000   47.330000  
std     28.953749   26.296697  
min      2.000000    1.000000  
25%     25.750000   26.750000  
50%     46.500000   49.000000  
75%     77.500000   66.000000  
max    100.000000   99.000000  


6. Data Selection and Manipulation

Pandas offers various ways to select and manipulate data within a DataFrame:

Selecting Columns: Access columns by name using square brackets or dot notation.

In [11]:
# Selecting a single column
names = df['Roll Number']  # or df.Name
print(names)

# Selecting multiple columns
age_and_city = df[['Course 1', 'Course 2']]
print(age_and_city)

0       1
1       2
2       3
3       4
4       5
     ... 
95     96
96     97
97     98
98     99
99    100
Name: Roll Number, Length: 100, dtype: int64
    Course 1  Course 2
0         25        48
1         23        97
2         86         7
3         48        43
4         99         9
..       ...       ...
95        29        60
96        44        91
97        16        20
98        62        93
99         2        78

[100 rows x 2 columns]


Filtering Rows: Use boolean indexing to filter rows based on conditions.

In [12]:
# Filter rows where Age is greater than 28
filtered_df = df[df['Roll Number'] > 28]
print(filtered_df)

    Roll Number  Course 1  Course 2  Course 3  Course 4  Course 5  Course 6
28           29        14        61         8        98        36        57
29           30        47        30        43        32        57        27
30           31        18        41        10        39        41        66
31           32        41         6        24        11        37        91
32           33        51        61        89        92        79        74
..          ...       ...       ...       ...       ...       ...       ...
95           96        29        60        75         6        56         2
96           97        44        91        90        28        62        29
97           98        16        20        16        32        21        83
98           99        62        93        55        33        72        32
99          100         2        78        17        11        25         1

[72 rows x 7 columns]


Sorting: Sort the DataFrame by one or more columns.




In [13]:
# Sort by Age in ascending order
sorted_df = df.sort_values(by=['Course 3'])
print(sorted_df)

    Roll Number  Course 1  Course 2  Course 3  Course 4  Course 5  Course 6
55           56        76        24         0        26        41        41
72           73        30        87         0         5        35        75
88           89        64        93         2         4        17        44
34           35         3        99         5        74         8        20
15           16        53        24         6        44        80        35
..          ...       ...       ...       ...       ...       ...       ...
21           22        21        79        96        68        85        50
58           59        12        56        97        97        45        52
45           46        33         5        97        45        21        32
83           84        25        74        97        31        84        35
76           77        23        27       100        81        56        70

[100 rows x 7 columns]


#Data cleaning and Processing

Identifying missing Values

In [14]:
df.isnull().sum()  # Returns the number of missing values in each column

Unnamed: 0,0
Roll Number,0
Course 1,0
Course 2,0
Course 3,0
Course 4,0
Course 5,0
Course 6,0


Removing Missing Values

In [16]:
df.dropna()  # Drops rows with any missing values
df.dropna(axis=1)  # Drops columns with any missing values

Unnamed: 0,Roll Number,Course 1,Course 2,Course 3,Course 4,Course 5,Course 6
0,1,25,48,77,15,28,84
1,2,23,97,14,5,15,40
2,3,86,7,78,11,25,67
3,4,48,43,45,0,2,2
4,5,99,9,34,96,46,9
...,...,...,...,...,...,...,...
95,96,29,60,75,6,56,2
96,97,44,91,90,28,62,29
97,98,16,20,16,32,21,83
98,99,62,93,55,33,72,32


Meaning of axis=0, axis=1, axis=2:
🟢 axis=0 → Operate down the rows (i.e., column-wise)
Used when you want to perform an operation on each column

Think: "go down each column"

🟦 axis=1 → Operate across columns (i.e., row-wise)
Used when you want to perform an operation on each row

Think: "go across the row"

🟣 axis=2 → Used in 3D arrays, like tensors (NumPy only)
Operate across the depth or channels, often used in image processing




Imputing Missing Values