In [1]:
import pandas as pd

## Q1

In [2]:
course_name = ["Data Science", "Machine Learning", "Big Data", "Data Engineer"]
duration = [2,3,6,4]
df = pd.DataFrame(data = {"course_name" : course_name, "duration" : duration})

In [3]:
df

Unnamed: 0,course_name,duration
0,Data Science,2
1,Machine Learning,3
2,Big Data,6
3,Data Engineer,4


In [4]:
df[1:2]

Unnamed: 0,course_name,duration
1,Machine Learning,3


## Q2

In Pandas, both loc and iloc are used for indexing and selecting data from a DataFrame. However, there is a key difference between these two functions.

loc is used for label-based indexing, which means that it selects data based on the row and column labels. It takes two arguments, separated by a comma, to specify the row and column labels. For example, to select a single value from a DataFrame, you can use the following code:
    
    

In [5]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c'])
df.iloc[0, 0]  # returns 1


1

In this example, df.loc['a', 'A'] selects the value in the first row and first column of the DataFrame, which is 1.

On the other hand, iloc is used for position-based indexing, which means that it selects data based on the row and column positions. It also takes two arguments, separated by a comma, to specify the row and column positions. For example, to select a single value from a DataFrame, you can use the following code:

In [6]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c'])
df.iloc[0, 0]  # returns 1


1

In this example, df.iloc[0, 0] selects the value in the first row and first column of the DataFrame, which is 1.

In summary, loc is used for label-based indexing and iloc is used for position-based indexing.

## Q3

In [10]:
course_name = ["Data Science", "Machine Learning", "Big Data", "Data Engineer"]
duration = [2,3,6,4]
df = pd.DataFrame(data = {"course_name" : course_name, "duration" : duration})

reindex = [3, 0, 1, 2]
new_df = df.reindex(reindex)

print(new_df.loc[2])
print(new_df.iloc[2])

course_name    Big Data
duration              6
Name: 2, dtype: object
course_name    Machine Learning
duration                      3
Name: 1, dtype: object


As you can see, new_df.loc[2] returns the row with index label 2 (which is "Big Data"), while new_df.iloc[2] returns the row with index position 2 (which is "Machine Learning")

## Q4

In [12]:
import numpy as np

# Creating a DataFrame
columns = ['column_1', 'column_2', 'column_3', 'column_4', 'column_5', 'column_6']
indices = [1, 2, 3, 4, 5, 6]
df1 = pd.DataFrame(np.random.rand(6, 6), columns=columns, index=indices)

In [14]:
df1.mean()

column_1    0.473732
column_2    0.581413
column_3    0.519963
column_4    0.482218
column_5    0.247956
column_6    0.508891
dtype: float64

In [15]:
df1['column_2'].std()

0.28934114841651815

## Q5

In [16]:
df1.loc[2, 'column_2'] = "string data"

# Finding the mean of column_2
try:
    mean_of_column_2 = df1['column_2'].mean()
    print("Mean of 'column_2':", mean_of_column_2)
except TypeError as e:
    print(f"TypeError: {e}")

TypeError: unsupported operand type(s) for +: 'float' and 'str'


we first used the loc[] method to replace the data in the second row of column_2 with a string variable. Then, we tried to find the mean of column_2 using the mean() method, but we got a TypeError because we replaced the numeric data in the second row of column_2 with a string.

To fix this error, we need to make sure that all the values in column_2 are numeric before finding the mean. We can do this by replacing the string data with a numeric value or removing the row altogether.

## Q6

In Pandas, a window function is a way of performing calculations on a subset of a DataFrame or a Series. It allows us to apply functions to a sliding window of data, where the size and shape of the window can be customized.

There are several types of window functions available in Pandas, including:

Rolling Functions: These functions work on a fixed-size sliding window and can be used to calculate rolling statistics, such as rolling mean or rolling standard deviation.

Expanding Functions: These functions work on an expanding window, where the window size starts at a minimum and grows until it encompasses all data points. This can be used to calculate cumulative statistics, such as cumulative sum or cumulative mean.

Exponentially Weighted Functions: These functions apply exponentially decreasing weights to each data point based on their age, giving more importance to recent data. They can be used to calculate exponentially weighted moving statistics, such as exponentially weighted moving average or exponentially weighted moving standard deviation.

Aggregation Functions: These functions can be used to aggregate data within a sliding window, such as counting the number of data points or finding the maximum or minimum value.

Transformation Functions: These functions can be used to perform calculations on a sliding window and return a new Series or DataFrame with the same shape as the original data, such as calculating the percent change in values over time.

Overall, window functions provide a powerful tool for analyzing time-series or sequential data, allowing us to calculate complex statistics and gain insights into the underlying patterns and trends.

## Q7

In [24]:
pd.datetime.now().strftime("%B %Y")

  pd.datetime.now().strftime("%B %Y")


'March 2023'

## Q8

In [25]:
# Get the two dates as input from the user
date1 = input("Enter the first date (YYYY-MM-DD): ")
date2 = input("Enter the second date (YYYY-MM-DD): ")

# Convert the input strings to Pandas datetime objects
date1 = pd.to_datetime(date1)
date2 = pd.to_datetime(date2)

# Calculate the difference between the two dates using timedelta
diff = date2 - date1

# Extract the number of days, hours, and minutes from the difference
days = diff.days
hours = diff.seconds // 3600
minutes = (diff.seconds // 60) % 60

# Display the result to the user
print("The difference between", date1.strftime("%Y-%m-%d"), "and", date2.strftime("%Y-%m-%d"), "is:")
print(days, "days,", hours, "hours, and", minutes, "minutes.")


Enter the first date (YYYY-MM-DD):  2023-05-23
Enter the second date (YYYY-MM-DD):  2024-05-23


The difference between 2023-05-23 and 2024-05-23 is:
366 days, 0 hours, and 0 minutes.


## Q9

In [None]:
file_path = input("Enter the file path: ")
col_name = input("Enter the column name: ")

# Read the CSV file into a Pandas DataFrame
df = pd.read_csv(file_path)

# Prompt the user to enter the category order
cat_order = input("Enter the category order (comma-separated list): ").split(",")

# Convert the specified column to a categorical data type with the specified category order
cat_type = pd.api.types.CategoricalDtype(categories=cat_order, ordered=True)
df[col_name] = df[col_name].astype(cat_type)

# Sort the data by the specified column
df = df.sort_values(by=[col_name])

# Display the sorted data
print(df)


## Q10

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Prompt the user to enter the file path
file_path = input("Enter the file path: ")

# Read the CSV file into a Pandas DataFrame
df = pd.read_csv(file_path)

# Set the 'date' column as the index
df.set_index('date', inplace=True)

# Reshape the data to have one row per date and product category
df = df.stack().reset_index().rename(columns={'level_1': 'category', 0: 'sales'})

# Create a pivot table to aggregate the sales by date and product category
pivot_table = pd.pivot_table(df, values='sales', index='date', columns='category', aggfunc='sum')

# Create a stacked bar chart of the sales by product category over time
pivot_table.plot(kind='bar', stacked=True)
plt.xlabel('Date')
plt.ylabel('Sales')
plt.title('Sales by Product Category over Time')
plt.show()

## Q11

In [None]:
# Prompt the user to enter the file path of the CSV file containing the student data
file_path = input("Enter the file path: ")

# Read the CSV file into a Pandas DataFrame
df = pd.read_csv(file_path)

# Calculate the mean, median, and mode of the test scores using Pandas tools
mean_score = df['Test Score'].mean()
median_score = df['Test Score'].median()
mode_score = df['Test Score'].mode().values[0]

# Display the mean, median, and mode in a table
data = {'Statistic': ['Mean', 'Median', 'Mode'], 'Test Score': [mean_score, median_score, mode_score]}
df_stats = pd.DataFrame(data)
print(df_stats)