# Module33 Pandas Advance2 Assignment

Consider following code to answer further questions:

```
import pandas as pd
course_name = [‘Data Science’, ‘Machine Learning’, ‘Big Data’, ‘Data Engineer’]
duration = [2,3,6,4]
df = pd.DataFrame(data = {‘course_name’ : course_name, ‘duration’ : duration})
```



Q1. Write a code to print the data present in the second row of the dataframe, df.

In [None]:
import pandas as pd

course_name = ['Data Science', 'Machine Learning', 'Big Data', 'Data Engineer']
duration = [2, 3, 6, 4]

df = pd.DataFrame(data={'course_name': course_name, 'duration': duration})

# Printing the second row (index 1)
print(df.iloc[1])


course_name    Machine Learning
duration                      3
Name: 1, dtype: object


Q2. What is the difference between the functions loc and iloc in pandas.DataFrame?

A2. Difference between the functions 'loc' and 'iloc' are:-

loc(Label-based indexing)

iloc(Integer- location based indexing)


## Features

**1. Indexing type**

a.) loc- 	Uses labels (row index names)

b.) iloc- Uses integer positions (0-based index)

**2. Selection**

a.) loc- ```df.loc[2]``` selects the row where index is 2

b.) iloc- ```df.iloc[2]``` selects the row at position 2 (third row)

**3. Slicing**

a.) loc- ```df.loc[1:3]``` includes row with index 3

b.) iloc- ```df.iloc[1:3]``` excludes row at index 3

**4. Column Selection**

a.) loc- ```df.loc[:, 'course_name']``` selects 'course_name' column

b.) iloc- ```df.iloc[:, 0]``` selects the first column

**5. Error Handling**

a.) loc- Throws an error if label doesn’t exist

b.) iloc- Works as long as index is within range


In [None]:
# Example

df.loc[1]  # Selects row where index = 1 (Machine Learning)
df.iloc[1]  # Selects the second row (Machine Learning)

Unnamed: 0,1
course_name,Machine Learning
duration,3


Q3. Reindex the given dataframe using a variable, ```reindex = [3,0,1,2]``` and store it in ```new_df``` then find the output for both ```new_df.loc[2]``` and ```new_df.iloc[2]```. Did you observe any difference in both the outputs? If so then explain it.

A3. Reindexing Code:

In [None]:
reindex = [3, 0, 1, 2]
new_df = df.reindex(reindex)

print("Using loc[2]:\n", new_df.loc[2])  # Uses index label
print("\nUsing iloc[2]:\n", new_df.iloc[2])  # Uses integer position


Using loc[2]:
 course_name    Big Data
duration              6
Name: 2, dtype: object

Using iloc[2]:
 course_name    Machine Learning
duration                      3
Name: 1, dtype: object


### Explanation:

1. ```new_df.loc[2]``` selects row where index is 2 (original index of 'Big Data').

2. ```new_df.iloc[2]``` selects the row at position 2 in the new order (original index 1, 'Machine Learning').

3. **Observation:**

a.) loc always looks for the original index label.

b.) iloc ignores the index labels and works positionally.

Q4. Write a code to find the following statistical measurements for the above dataframe df1:

(i) mean of each and every column present in the dataframe.

(ii) standard deviation of column, ‘column_2’

A4.
# **Statistical Measurements in DataFrame df1**

(i) Finding the Mean of Each Column

In [None]:
import numpy as np

columns = ['column_1', 'column_2', 'column_3', 'column_4', 'column_5', 'column_6']
indices = [1, 2, 3, 4, 5, 6]

# Creating a dataframe with random values
df1 = pd.DataFrame(np.random.rand(6, 6), columns=columns, index=indices)

In [None]:
column_means = df1.mean()
print("Mean of each column:\n", column_means)


**Explanation:**

1.) ```df1.mean()``` calculates the mean of each column.

2.) By default, axis=0 (column-wise mean).


(ii) Finding the Standard Deviation of ```column_2```

In [None]:
column2_std = df1['column_2'].std()
print("\nStandard deviation of column_2:", column2_std)



Standard deviation of column_2: 0.2805991083901862


Q5. Replace the data present in the second row of column, ‘column_2’ by a string variable then find the mean of column, column_2.

If you are getting errors in executing it then explain why.

[Hint: To replace the data use df1.loc[] and equate this to string data of your choice.]

A5. Step 1: Replacing the Second Row of column_2 with a String

In [None]:
df1.loc[2, 'column_2'] = "string_value"

  df1.loc[2, 'column_2'] = "string_value"


We are replacing the second row (index 2) of 'column_2' with a string.

Step 2: Trying to Calculate the Mean of column_2

In [None]:
mean_column2 = df1['column_2'].mean()
print("Mean of column_2:", mean_column2)


TypeError: unsupported operand type(s) for +: 'float' and 'str'

## What Will Happen?
🔥 Error!
The mean() function will raise an error because 'column_2' now contains a string, and mean can only be calculated on numeric values.

Expected Error Message
```
TypeError: unsupported operand type(s) for +: 'float' and 'str'
```

**Explanation:**

Pandas tries to compute the mean, but encounters a string value in 'column_2'.

Since arithmetic operations (+ and /) between numbers and strings are not allowed, it throws a TypeError.

# How to Fix the Error?

Solution 1: Convert the Column to Numeric (Ignoring Errors)

In [None]:
df1['column_2'] = pd.to_numeric(df1['column_2'], errors='coerce')
mean_column2 = df1['column_2'].mean()
print("Mean of column_2 after conversion:", mean_column2)


Mean of column_2 after conversion: 0.40822028785282444


🔹 pd.to_numeric(errors='coerce') converts non-numeric values into NaN, allowing the mean calculation to proceed.

Solution 2: Remove Non-Numeric Rows Before Calculating Mean

In [None]:
mean_column2 = df1[df1['column_2'].apply(lambda x: isinstance(x, (int, float)))]['column_2'].mean()
print("Mean of column_2 after filtering:", mean_column2)


Mean of column_2 after filtering: 0.40822028785282444


🔹 This method filters out non-numeric values before computing the mean.

Q6. What do you understand about the windows function in pandas and list the types of windows functions?

A6. **What is a Window Function in Pandas?**

A window function in Pandas is used for performing calculations over a rolling or expanding window of data points. These functions are useful in time-series analysis, moving averages, smoothing, and trend detection.

They work by sliding over the data (row-wise), performing operations on a subset of rows within the specified window.

### Types of Window Functions in Pandas

1.) **Rolling Window Functions (rolling())**

a.) Performs operations on a fixed-size moving window.

b.) Example: Moving Average, Moving Sum, Moving Standard Deviation.

Example Code:

In [None]:
import pandas as pd
data = {'Values': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)

# 3-point Moving Average
df['Moving_Avg'] = df['Values'].rolling(window=3).mean()
print(df)


   Values  Moving_Avg
0      10         NaN
1      20         NaN
2      30        20.0
3      40        30.0
4      50        40.0


2.) **Expanding Window Functions (expanding())**

a.) Expands the window size dynamically from the start to the current row.

Example: Cumulative Sum, Cumulative Mean.

Example Code:

In [None]:
df['Cumulative_Avg'] = df['Values'].expanding().mean()
print(df)


   Values  Moving_Avg  Cumulative_Avg
0      10         NaN            10.0
1      20         NaN            15.0
2      30        20.0            20.0
3      40        30.0            25.0
4      50        40.0            30.0


3.) **Exponentially Weighted Window Functions (ewm())**

a.) Assigns more weight to recent values, useful in time series smoothing.

Example: Exponentially Weighted Moving Average (EWMA).

Example Code:

In [None]:
df['EWMA'] = df['Values'].ewm(span=3, adjust=False).mean()
print(df)


   Values  Moving_Avg  Cumulative_Avg    EWMA
0      10         NaN            10.0  10.000
1      20         NaN            15.0  15.000
2      30        20.0            20.0  22.500
3      40        30.0            25.0  31.250
4      50        40.0            30.0  40.625


4.) **GroupBy Window Functions**

a.) Applies window functions within groups.

Example: Rolling Mean within Categories.

Example Code:

In [None]:
df['Group'] = ['A', 'A', 'B', 'B', 'B']
df['Grouped_Rolling_Avg'] = df.groupby('Group')['Values'].rolling(2).mean().reset_index(0, drop=True)
print(df)


   Values  Moving_Avg  Cumulative_Avg    EWMA Group  Grouped_Rolling_Avg
0      10         NaN            10.0  10.000     A                  NaN
1      20         NaN            15.0  15.000     A                 15.0
2      30        20.0            20.0  22.500     B                  NaN
3      40        30.0            25.0  31.250     B                 35.0
4      50        40.0            30.0  40.625     B                 45.0


### Key Takeaways

✅ rolling() → Fixed-size moving window

✅ expanding() → Expanding window (cumulative operations)

✅ ewm() → Exponentially weighted window

✅ groupby().rolling() → Rolling window within groups

Q7. Write a code to print only the current month and year at the time of answering this question.

[Hint: Use pandas.datetime function]

In [None]:
# A7.

import pandas as pd

# Get current date
current_date = pd.to_datetime("today")

# Extract month and year
current_month = current_date.month
current_year = current_date.year

print(f"Current Month: {current_month}")
print(f"Current Year: {current_year}")


Current Month: 3
Current Year: 2025


Q8. Write a Python program that takes in two dates as input (in the format YYYY-MM-DD) and
calculates the difference between them in days, hours, and minutes using Pandas time delta. The program should prompt the user to enter the dates and display the result.

In [None]:
# A8.

import pandas as pd

# Take user input for two dates
date1 = input("Enter the first date (YYYY-MM-DD): ")
date2 = input("Enter the second date (YYYY-MM-DD): ")

# Convert strings to datetime
date1 = pd.to_datetime(date1)
date2 = pd.to_datetime(date2)

# Calculate the difference
time_diff = abs(date2 - date1)

# Extract days, hours, and minutes
days = time_diff.days
hours = time_diff.seconds // 3600
minutes = (time_diff.seconds % 3600) // 60

# Display the result
print(f"Difference: {days} days, {hours} hours, {minutes} minutes")


Enter the first date (YYYY-MM-DD): 1999-05-08
Enter the second date (YYYY-MM-DD): 2025-03-23
Difference: 9451 days, 0 hours, 0 minutes


Q9. Write a Python program that reads a CSV file containing categorical data and converts a specified column to a categorical data type. The program should prompt the user to enter the file path, column name, and category order, and then display the sorted data.

In [None]:
# A9.

import pandas as pd

# User input for file path
file_path = input("Enter the CSV file path: ")

# Read the CSV file
df = pd.read_csv(file_path)

# Display available columns
print("\nAvailable columns in the dataset:", df.columns.tolist())

# User input for the column name
column_name = input("\nEnter the column name to convert to categorical: ")

# Check if the column exists
if column_name not in df.columns:
    print("Error: Column not found in the dataset.")
else:
    # Get unique values from the column
    unique_values = df[column_name].dropna().unique().tolist()
    print("\nUnique values in the column:", unique_values)

    # User input for category order
    category_order = input("\nEnter the category order separated by commas: ").split(",")

    # Convert the column to categorical type with specified order
    df[column_name] = pd.Categorical(df[column_name], categories=category_order, ordered=True)

    # Sort data by the categorical column
    sorted_df = df.sort_values(by=column_name)

    # Display the sorted data
    print("\nSorted Data:")
    print(sorted_df.head())

    # Optional: Save the sorted file
    sorted_df.to_csv("sorted_output.csv", index=False)
    print("\nSorted data saved as 'sorted_output.csv'.")



Available columns in the dataset: ['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income', 'median_house_value']

Unique values in the column: [606.0, 277.0, 495.0, 11.0, 237.0, 204.0, 218.0, 441.0, 599.0, 603.0, 261.0, 138.0, 170.0, 659.0, 331.0, 50.0, 107.0, 595.0, 199.0, 1258.0, 255.0, 305.0, 561.0, 480.0, 940.0, 225.0, 979.0, 402.0, 1539.0, 901.0, 229.0, 500.0, 619.0, 2651.0, 169.0, 728.0, 419.0, 190.0, 985.0, 287.0, 429.0, 226.0, 294.0, 1672.0, 242.0, 329.0, 350.0, 1593.0, 1780.0, 122.0, 464.0, 366.0, 335.0, 510.0, 468.0, 282.0, 586.0, 139.0, 285.0, 307.0, 404.0, 729.0, 1123.0, 228.0, 702.0, 378.0, 387.0, 515.0, 384.0, 274.0, 234.0, 679.0, 834.0, 957.0, 147.0, 375.0, 351.0, 1377.0, 467.0, 558.0, 182.0, 581.0, 222.0, 321.0, 173.0, 275.0, 850.0, 479.0, 302.0, 519.0, 317.0, 416.0, 1231.0, 456.0, 496.0, 407.0, 569.0, 364.0, 417.0, 301.0, 273.0, 434.0, 443.0, 414.0, 81.0, 236.0, 425.0, 73.0, 1007.0, 536.0, 930.0, 759

Q10. Write a Python program that reads a CSV file containing sales data for different products and visualizes the data using a stacked bar chart to show the sales of each product category over time. The program should prompt the user to enter the file path and display the chart.

In [1]:
# A10.

import pandas as pd
import matplotlib.pyplot as plt

# User input for the file path
file_path = input("Enter the CSV file path: ")

# Read the CSV file
df = pd.read_csv(file_path)

# Display available columns
print("\nAvailable columns in the dataset:", df.columns.tolist())

# Checking if required columns exist
required_columns = ['Date', 'Product_Category', 'Sales']
if not all(col in df.columns for col in required_columns):
    print(f"Error: The dataset must contain the columns {required_columns}.")
else:
    # Convert 'Date' column to datetime
    df['Date'] = pd.to_datetime(df['Date'])

    # Pivot the DataFrame to get sales data for each category over time
    sales_pivot = df.pivot_table(index='Date', columns='Product_Category', values='Sales', aggfunc='sum')

    # Plot the stacked bar chart
    sales_pivot.plot(kind='bar', stacked=True, figsize=(12, 6), colormap='viridis')

    # Formatting the plot
    plt.xlabel('Date')
    plt.ylabel('Sales')
    plt.title('Sales of Each Product Category Over Time')
    plt.xticks(rotation=45)
    plt.legend(title="Product Category")

    # Display the chart
    plt.show()


Enter the CSV file path: /content/sample_data/mnist_test.csv

Available columns in the dataset: ['7', '0', '0.1', '0.2', '0.3', '0.4', '0.5', '0.6', '0.7', '0.8', '0.9', '0.10', '0.11', '0.12', '0.13', '0.14', '0.15', '0.16', '0.17', '0.18', '0.19', '0.20', '0.21', '0.22', '0.23', '0.24', '0.25', '0.26', '0.27', '0.28', '0.29', '0.30', '0.31', '0.32', '0.33', '0.34', '0.35', '0.36', '0.37', '0.38', '0.39', '0.40', '0.41', '0.42', '0.43', '0.44', '0.45', '0.46', '0.47', '0.48', '0.49', '0.50', '0.51', '0.52', '0.53', '0.54', '0.55', '0.56', '0.57', '0.58', '0.59', '0.60', '0.61', '0.62', '0.63', '0.64', '0.65', '0.66', '0.67', '0.68', '0.69', '0.70', '0.71', '0.72', '0.73', '0.74', '0.75', '0.76', '0.77', '0.78', '0.79', '0.80', '0.81', '0.82', '0.83', '0.84', '0.85', '0.86', '0.87', '0.88', '0.89', '0.90', '0.91', '0.92', '0.93', '0.94', '0.95', '0.96', '0.97', '0.98', '0.99', '0.100', '0.101', '0.102', '0.103', '0.104', '0.105', '0.106', '0.107', '0.108', '0.109', '0.110', '0.111', '0

Q11. You are given a CSV file containing student data that includes the student ID and their test score. Write a Python program that reads the CSV file, calculates the mean, median, and mode of the test scores, and
displays the results in a table.

The program should do the following:

a.)  Prompt the user to enter the file path of the CSV file containing the student data.

b.) Read the CSV file into a Pandas DataFrame.

c.) Calculate the mean, median, and mode of the test scores using Pandas toolsR

d.) Display the mean, median, and mode in a table.

Assume the CSV file contains the following columns

a.) Student ID: The ID of the studentR

b.) Test Score: The score of the student's test.

Example usage of the program:
Enter the file path of the CSV file containing the student data: student_data.csv

```
+-----------+--------+
| Statistic | Value |
+-----------+--------+
| Mean | 79.6 |
| Median | 82 |
| Mode | 85, 90 |
+-----------+--------+

Assume that the CSV file student_data.csv contains the following data:
Student ID,Test Score
1,85
2,90
3,80
4,75
5,85
6,82
7,78
8,85
9,90
10,85
```

The program should calculate the mean, median, and mode of the test scores and display the results in a table.

In [1]:
# A11.

import pandas as pd
from tabulate import tabulate

# Prompt user for file path
file_path = input("Enter the file path of the CSV file containing the student data: ")

# Read CSV into a Pandas DataFrame
df = pd.read_csv(file_path)

# Ensure required columns exist
if "Test Score" not in df.columns:
    print("Error: The dataset must contain a 'Test Score' column.")
else:
    # Calculate statistics
    mean_score = df["Test Score"].mean()
    median_score = df["Test Score"].median()
    mode_score = df["Test Score"].mode().tolist()

    # Convert mode to a comma-separated string if multiple values exist
    mode_str = ", ".join(map(str, mode_score))

    # Create a table to display results
    results = [
        ["Mean", round(mean_score, 2)],
        ["Median", median_score],
        ["Mode", mode_str]
    ]

    # Print results in table format
    print("\n" + tabulate(results, headers=["Statistic", "Value"], tablefmt="grid"))


Enter the file path of the CSV file containing the student data: /content/sample_data/testscore.csv

+-------------+---------+
| Statistic   |   Value |
| Mean        |    83.5 |
+-------------+---------+
| Median      |    85   |
+-------------+---------+
| Mode        |    85   |
+-------------+---------+
