# Assignment

In [2]:
# Consider following code to answer further questions:
import pandas as pd
course_name = ['Data Science', 'Machine Learning', 'Big Data', 'Data Engineer']
duration = [2,3,6,4]
df = pd.DataFrame(data = {'course_name' : course_name, 'duration' : duration})

## Q1. Write a code to print the data present in the second row of the dataframe, df.

In [3]:
second_row = df.iloc[1]
print(second_row)

course_name    Machine Learning
duration                      3
Name: 1, dtype: object


## Q2. What is the difference between the functions loc and iloc in pandas.DataFrame?

Ans: In pandas, the functions `.loc` and `.iloc` are used to access and manipulate data in a DataFrame, but they have some key differences in how they handle indexing.

1. `.loc`:
The `.loc` indexer is primarily label-based, meaning it uses the labels of rows or columns to access data. It can accept multiple forms of indexing, including:
   - Single label: `df.loc[row_label, column_label]`
   - List of labels: `df.loc[[row_label1, row_label2], [column_label1, column_label2]]`
   - Slicing: `df.loc[row_label_start:row_label_end, column_label_start:column_label_end]`
   - Conditional indexing: `df.loc[df['column_label'] > value]`

2. `.iloc`:
The `.iloc` indexer is primarily integer-based, meaning it uses integer positions to access data. It accepts integer values or slices to retrieve data. Examples include:
   - Single integer: `df.iloc[row_index, column_index]`
   - List of integers: `df.iloc[[row_index1, row_index2], [column_index1, column_index2]]`
   - Slicing: `df.iloc[row_index_start:row_index_end, column_index_start:column_index_end]`

In summary, the main difference between `.loc` and `.iloc` is the type of indexing used. `.loc` uses label-based indexing, while `.iloc` uses integer-based indexing. The choice between these two methods depends on whether you want to access data based on labels or integer positions.

## Q3. Reindex the given dataframe using a variable, reindex = [3,0,1,2] and store it in the variable, new_df then find the output for both new_df.loc[2] and new_df.iloc[2].

In [4]:
reindex = [3, 0, 1, 2]
new_df = df.reindex(reindex)

print(new_df.loc[2])
print(new_df.iloc[2])

course_name    Big Data
duration              6
Name: 2, dtype: object
course_name    Machine Learning
duration                      3
Name: 1, dtype: object


When you reindexed the DataFrame using the reindex variable [3, 0, 1, 2], the new DataFrame new_df will have its rows rearranged according to the specified index order

In [1]:
import pandas as pd
import numpy as np
columns = ['column_1', 'column_2', 'column_3', 'column_4', 'column_5', 'column_6']
indices = [1,2,3,4,5,6]
#Creating a dataframe:
df1 = pd.DataFrame(np.random.rand(6,6), columns = columns, index = indices)

## Q4. Write a code to find the following statistical measurements for the above dataframe df1: 
## (i) mean of each and every column present in the dataframe.
## (ii) standard deviation of column, ‘column_2’

In [2]:
# Mean of each column
column_means = df1.mean()
print("Mean of each column:")
print(column_means)

# Standard deviation of column_2
column_2_std = df1['column_2'].std()
print("\nStandard deviation of column_2:")
print(column_2_std)

Mean of each column:
column_1    0.562772
column_2    0.559832
column_3    0.601998
column_4    0.368296
column_5    0.502065
column_6    0.622037
dtype: float64

Standard deviation of column_2:
0.28767682550923784


## Q5. Replace the data present in the second row of column, ‘column_2’ by a string variable then find the mean of column, column_2.
If you are getting errors in executing it then explain why.
[Hint: To replace the data use df1.loc[] and equate this to string data of your choice.]

In [3]:
# Replace data in the second row of 'column_2'
df1.loc[2, 'column_2'] = 'string_data'

# Find the mean of 'column_2'
column_2_mean = df1['column_2'].mean()
print("Mean of column_2 after replacement:")
print(column_2_mean)


TypeError: unsupported operand type(s) for +: 'float' and 'str'

## Q6. What do you understand about the windows function in pandas and list the types of windows functions?

Ans: In pandas, window functions are used for performing calculations over a specified window or group of data points in a DataFrame. These functions allow you to calculate statistics or apply transformations to a specific subset of data based on a defined window or group.

Pandas provides several types of window functions, which can be categorized into the following groups:

1. Aggregation Functions:
   - `rolling()` function: Calculates the rolling window aggregation, such as mean, sum, min, max, etc., over a specified window size.
   - `expanding()` function: Calculates the expanding window aggregation, which includes all data from the start to the current point.
   - `ewm()` function: Calculates the exponentially weighted moving average over a specified window.

2. Transformation Functions:
   - `shift()` function: Shifts the values of a series or DataFrame by a specified number of periods.
   - `diff()` function: Calculates the difference between consecutive values in a series or DataFrame.

3. Ranking Functions:
   - `rank()` function: Assigns ranks to the data points based on their values within a window.

4. Window-specific Functions:
   - `window()` function: Provides a flexible way to define custom window functions by specifying the window size and applying a function to the data within the window.

These window functions allow you to perform a wide range of operations on your data, such as calculating rolling averages, cumulative sums, lagged values, and more. They are particularly useful for time series analysis, data smoothing, and feature engineering tasks.

## Q7. Write a code to print only the current month and year at the time of answering this question.
[Hint: Use pandas.datetime function]

In [4]:
import pandas as pd

# Get the current date and time
current_datetime = pd.Timestamp.now()

# Extract the month and year from the current date
current_month = current_datetime.month
current_year = current_datetime.year

# Print the current month and year
print("Current month:", current_month)
print("Current year:", current_year)

Current month: 6
Current year: 2023


## Q8. Write a Python program that takes in two dates as input (in the format YYYY-MM-DD) and calculates the difference between them in days, hours, and minutes using Pandas time delta. The program should prompt the user to enter the dates and display the result.

In [5]:
import pandas as pd

# Prompt the user to enter the dates
date1 = input("Enter the first date (YYYY-MM-DD): ")
date2 = input("Enter the second date (YYYY-MM-DD): ")

# Convert the input strings to datetime objects
datetime1 = pd.to_datetime(date1)
datetime2 = pd.to_datetime(date2)

# Calculate the time difference
time_difference = datetime2 - datetime1

# Extract the days, hours, and minutes from the time difference
days = time_difference.days
hours = time_difference.seconds // 3600
minutes = (time_difference.seconds % 3600) // 60

# Display the result
print("Time difference:")
print("Days:", days)
print("Hours:", hours)
print("Minutes:", minutes)

Enter the first date (YYYY-MM-DD):  2023-08-15
Enter the second date (YYYY-MM-DD):  2023-08-17


Time difference:
Days: 2
Hours: 0
Minutes: 0


## Q9. Write a Python program that reads a CSV file containing categorical data and converts a specified column to a categorical data type. The program should prompt the user to enter the file path, column name, and category order, and then display the sorted data.

In [None]:
import pandas as pd

# Prompt the user to enter the file path
file_path = input("Enter the file path of the CSV file: ")

# Read the CSV file into a DataFrame
df = pd.read_csv(file_path)

# Prompt the user to enter the column name and category order
column_name = input("Enter the column name: ")
category_order = input("Enter the category order (comma-separated): ")

# Convert the specified column to categorical data type
df[column_name] = pd.Categorical(df[column_name], categories=category_order.split(","))

# Sort the DataFrame by the specified column
df_sorted = df.sort_values(column_name)

# Display the sorted data
print("Sorted Data:")
print(df_sorted)


## Q10. Write a Python program that reads a CSV file containing sales data for different products and visualizes the data using a stacked bar chart to show the sales of each product category over time. The program should prompt the user to enter the file path and display the chart.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Prompt the user to enter the file path
file_path = input("Enter the file path of the CSV file: ")

# Read the CSV file into a DataFrame
df = pd.read_csv(file_path)

# Set the date column as the index
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)

# Plot the stacked bar chart
df.plot(kind='bar', stacked=True)

# Set the chart title and axis labels
plt.title("Sales by Product Category Over Time")
plt.xlabel("Date")
plt.ylabel("Sales")

# Display the chart
plt.show()


## Q11. You are given a CSV file containing student data that includes the student ID and their test score. Write a Python program that reads the CSV file, calculates the mean, median, and mode of the test scores, and displays the results in a table.

In [None]:
import pandas as pd

# Prompt the user to enter the file path
file_path = input("Enter the file path of the CSV file: ")

# Read the CSV file into a DataFrame
df = pd.read_csv(file_path)

# Calculate the mean, median, and mode of the test scores
mean_score = df['Test Score'].mean()
median_score = df['Test Score'].median()
mode_scores = df['Test Score'].mode().to_list()

# Create a table to display the results
result_table = pd.DataFrame({'Statistic': ['Mean', 'Median', 'Mode'], 'Value': [mean_score, median_score, mode_scores]})

# Display the table
print(result_table)
