<a href="https://colab.research.google.com/github/Nikulkumar-Dabhi/Data_Science/blob/main/pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Pandas** is a powerful and widely-used Python library for data analysis and manipulation. It provides data structures like DataFrames and Series that make it easy to work with structured data, such as tables and time series.

1. Installation

In [1]:
pip install pandas



2. Importing Pandas

In [2]:
import pandas as pd

3. Data Structures: Series and DataFrames

Series: A one-dimensional labeled array capable of holding any data type. Think of it like a column in a table.

In [3]:

# Creating a Series from a list
data = [10, 20, 30, 40, 50]
series = pd.Series(data)
print(series)

0    10
1    20
2    30
3    40
4    50
dtype: int64


DataFrame: A two-dimensional table-like structure with rows and columns. It's the most commonly used data structure in Pandas.

In [4]:
# Creating a DataFrame from a dictionary
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 28],
        'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
print(df)

      Name  Age      City
0    Alice   25  New York
1      Bob   30    London
2  Charlie   28     Paris


Generating Sample CSV

In [8]:
import csv
import random

# Define the number of students and courses
num_students = 100
num_courses = 6

# Create a list to store the data
data = []

# Generate data for each student
for i in range(num_students):
    roll_number = i + 1  # Assign roll numbers sequentially
    scores = [random.randint(0, 100) for _ in range(num_courses)]  # Generate random scores
    data.append([roll_number] + scores)  # Add roll number and scores to the data list

# Define the header row
header = ['Roll Number'] + [f'Course {j}' for j in range(1, num_courses + 1)]
import csv
import random

# Define the number of students and courses
num_students = 100
num_courses = 6

# Create a list to store the data
data = []

# Generate data for each student
for i in range(num_students):
    roll_number = i + 1  # Assign roll numbers sequentially
    scores = [random.randint(0, 100) for _ in range(num_courses)]  # Generate random scores
    data.append([roll_number] + scores)  # Add roll number and scores to the data list

# Define the header row
header = ['Roll Number'] + [f'Course {j}' for j in range(1, num_courses + 1)]

# Write the data to a CSV file
with open('student_scores.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(header)
    writer.writerows(data)

4. Reading and Writing Data

Pandas allows you to easily read and write data from various file formats like CSV, Excel, and more.

In [9]:
# Reading a CSV file into a DataFrame
df = pd.read_csv('student_scores.csv')  # Replace 'your_file.csv' with the actual file path
print(df)

    Roll Number  Course 1  Course 2  Course 3  Course 4  Course 5  Course 6
0             1        34        24        23         5        62        37
1             2        45        72        20        77        78        10
2             3        37        75        26        12        84        17
3             4        65        66        74        26        38        65
4             5        42        83        85        76        13        79
..          ...       ...       ...       ...       ...       ...       ...
95           96        37        62        95        96        31        77
96           97        46        69        83        23        49        69
97           98        38        89        45        27        98        23
98           99        63        56        63        68        31        93
99          100        94        96         4        38        12        88

[100 rows x 7 columns]


5. Basic Data Exploration

head() and tail(): Display the first or last few rows of the DataFrame.

In [11]:
# Display the first 5 rows
print(df.head(1))

# Display the last 3 rows
print(df.tail(3))

   Roll Number  Course 1  Course 2  Course 3  Course 4  Course 5  Course 6
0            1        34        24        23         5        62        37
    Roll Number  Course 1  Course 2  Course 3  Course 4  Course 5  Course 6
97           98        38        89        45        27        98        23
98           99        63        56        63        68        31        93
99          100        94        96         4        38        12        88


info(): Get a summary of the DataFrame, including data types and missing values.

In [13]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   Roll Number  100 non-null    int64
 1   Course 1     100 non-null    int64
 2   Course 2     100 non-null    int64
 3   Course 3     100 non-null    int64
 4   Course 4     100 non-null    int64
 5   Course 5     100 non-null    int64
 6   Course 6     100 non-null    int64
dtypes: int64(7)
memory usage: 5.6 KB
None


describe(): Get descriptive statistics of numerical columns.

In [16]:
print(df.describe())

       Roll Number   Course 1    Course 2    Course 3    Course 4    Course 5  \
count   100.000000  100.00000  100.000000  100.000000  100.000000  100.000000   
mean     50.500000   55.11000   49.540000   48.670000   44.570000   49.930000   
std      29.011492   28.37551   30.005191   30.712763   29.669647   28.132534   
min       1.000000    1.00000    0.000000    0.000000    0.000000    0.000000   
25%      25.750000   29.00000   24.750000   22.750000   17.500000   29.000000   
50%      50.500000   55.50000   50.500000   45.500000   39.500000   45.000000   
75%      75.250000   81.00000   75.000000   78.250000   69.000000   75.000000   
max     100.000000  100.00000  100.000000  100.000000  100.000000  100.000000   

         Course 6  
count  100.000000  
mean    49.940000  
std     29.914633  
min      1.000000  
25%     23.000000  
50%     47.000000  
75%     77.000000  
max     99.000000  


6. Data Selection and Manipulation

Pandas offers various ways to select and manipulate data within a DataFrame:

Selecting Columns: Access columns by name using square brackets or dot notation.

In [21]:
# Selecting a single column
names = df['Roll Number']  # or df.Name
print(names)

# Selecting multiple columns
age_and_city = df[['Course 1', 'Course 2']]
print(age_and_city)

0       1
1       2
2       3
3       4
4       5
     ... 
95     96
96     97
97     98
98     99
99    100
Name: Roll Number, Length: 100, dtype: int64
    Course 1  Course 2
0         34        24
1         45        72
2         37        75
3         65        66
4         42        83
..       ...       ...
95        37        62
96        46        69
97        38        89
98        63        56
99        94        96

[100 rows x 2 columns]


Filtering Rows: Use boolean indexing to filter rows based on conditions.

In [23]:
# Filter rows where Age is greater than 28
filtered_df = df[df['Roll Number'] > 28]
print(filtered_df)

    Roll Number  Course 1  Course 2  Course 3  Course 4  Course 5  Course 6
28           29        77        14        74        21        20        41
29           30        93        86        13        68        80        79
30           31        86        31        72        30        74        73
31           32        94         3        16        98        27        69
32           33        36         8        12        20        10         2
..          ...       ...       ...       ...       ...       ...       ...
95           96        37        62        95        96        31        77
96           97        46        69        83        23        49        69
97           98        38        89        45        27        98        23
98           99        63        56        63        68        31        93
99          100        94        96         4        38        12        88

[72 rows x 7 columns]


Sorting: Sort the DataFrame by one or more columns.




In [25]:
# Sort by Age in ascending order
sorted_df = df.sort_values(by=['Course 3'])
print(sorted_df)

    Roll Number  Course 1  Course 2  Course 3  Course 4  Course 5  Course 6
11           12        11        40         0        19        18        96
57           58         7        49         1        23        30        38
68           69        92        32         1         7        19        24
67           68         9        89         3        40        87         7
99          100        94        96         4        38        12        88
..          ...       ...       ...       ...       ...       ...       ...
95           96        37        62        95        96        31        77
61           62         8         9        97         5        36        53
48           49        87        76        98        38        78        21
56           57        84        94       100        32        93         8
20           21       100         5       100        14        12        49

[100 rows x 7 columns]
