## Consider following code to answer further questions:

In [2]:
import pandas as pd
course_name = ['Data Science', 'Machine Learning', 'Big Data', 'Data Engineer']
duration = [2,3,6,4]
df = pd.DataFrame(data = {'course_name' : course_name, 'duration' : duration})

## Q1. Write a code to print the data present in the second row of the dataframe, df.

In [3]:
df.iloc[1]

course_name    Machine Learning
duration                      3
Name: 1, dtype: object

## Q2. What is the difference between the functions loc and iloc in pandas.DataFrame?

In pandas, the functions `loc` and `iloc` are used to access and manipulate data in a DataFrame, but they differ in their indexing methods. Here's an explanation of each:

1. `loc` (label-based indexing):
   - `loc` allows you to access rows and columns in a DataFrame using **labels** or **boolean arrays**.
   - When using `loc`, you specify the **row labels** and **column labels** explicitly.
   - The syntax for using `loc` is `df.loc[row_label, column_label]`.
   - You can pass a single label or a list of labels for rows and columns.

2. `iloc` (integer-based indexing):
   - `iloc` allows you to access rows and columns in a DataFrame using **integer-based positions**.
   - When using `iloc`, you specify the **row positions** and **column positions** explicitly.
   - The syntax for using `iloc` is `df.iloc[row_position, column_position]`.
   - You can pass a single position or a list of positions for rows and columns.



## Q3. Reindex the given dataframe using a variable, reindex = [3,0,1,2] and store it in the variable, new_df then find the output for both new_df.loc[2] and new_df.iloc[2].

## Did you observe any difference in both the outputs? If so then explain it.

`new_df.loc[2]` retrieves the row with label 2 in the DataFrame new_df. It returns a pandas Series containing the values `[30, 70]`, corresponding to columns 'A' and 'B'.

`new_df.iloc[2]` retrieves the row at position 2 in the DataFrame new_df. It returns a pandas Series containing the values `[20, 60]`, corresponding to columns 'A' and 'B'.

In [8]:
reindex=[3,0,1,2]
new_df=df.reindex(reindex)
new_df.loc[2]

course_name    Big Data
duration              6
Name: 2, dtype: object

In [7]:
new_df.iloc[2]

course_name    Machine Learning
duration                      3
Name: 1, dtype: object

## Consider the below code to answer further questions:

In [10]:
import pandas as pd
import numpy as np
columns = ['column_1', 'column_2', 'column_3', 'column_4', 'column_5', 'column_6']
indices = [1,2,3,4,5,6]
#Creating a dataframe:
df1 = pd.DataFrame(np.random.rand(6,6), columns = columns, index = indices)
df1

Unnamed: 0,column_1,column_2,column_3,column_4,column_5,column_6
1,0.312536,0.349255,0.665746,0.215175,0.219141,0.43111
2,0.371018,0.163757,0.463015,0.287852,0.062186,0.945928
3,0.948483,0.466972,0.556908,0.115868,0.930562,0.660665
4,0.852287,0.085744,0.302462,0.874817,0.783869,0.876854
5,0.212119,0.866695,0.397363,0.310642,0.667827,0.866323
6,0.978972,0.127798,0.732816,0.062434,0.392588,0.984597


## Q4. Write a code to find the following statistical measurements for the above dataframe df1:
## (i) mean of each and every column present in the dataframe.
## (ii) standard deviation of column, ‘column_2’

In [11]:
df1.mean()

column_1    0.612569
column_2    0.343370
column_3    0.519718
column_4    0.311131
column_5    0.509362
column_6    0.794246
dtype: float64

In [13]:
df1['column_2'].std()

0.2947148167890711

## Q5. Replace the data present in the second row of column, ‘column_2’ by a string variable then find the mean of column, column_2.
## If you are getting errors in executing it then explain why.
## [Hint: To replace the data use df1.loc[] and equate this to string data of your choice.]


We get a `TypeError: unsupported operand type(s) for +: 'float' and 'str'` as string and integer cannot be used with '+' operator

## Q6. What do you understand about the windows function in pandas and list the types of windows functions?

In pandas, window functions allow you to perform calculations on a specific window or subset of data within a DataFrame or Series. These functions operate on a rolling or expanding window of data and can be used to compute various statistical aggregations, such as moving averages, cumulative sums, and more.

The types of window functions available in pandas are as follows:

1. Rolling Window Functions:
   - Rolling functions calculate values based on a fixed-size sliding window.
   - They can be applied to both DataFrame and Series objects.
   - Examples of rolling window functions include `rolling.mean()`, `rolling.sum()`, `rolling.std()`, `rolling.max()`, `rolling.min()`, etc.

2. Expanding Window Functions:
   - Expanding functions calculate values based on all data points up to the current point.
   - They can be applied to both DataFrame and Series objects.
   - Examples of expanding window functions include `expanding.mean()`, `expanding.sum()`, `expanding.std()`, `expanding.max()`, `expanding.min()`, etc.

3. Rolling Time Window Functions:
   - Rolling time window functions are similar to rolling window functions but operate on a time-based index.
   - They require a time-based index in the DataFrame or Series.
   - Examples of rolling time window functions include `rolling(window, freq).mean()`, `rolling(window, freq).sum()`, `rolling(window, freq).std()`, etc., where `window` specifies the size of the rolling window and `freq` specifies the time frequency.

4. Expanding Time Window Functions:
   - Expanding time window functions are similar to expanding window functions but operate on a time-based index.
   - They require a time-based index in the DataFrame or Series.
   - Examples of expanding time window functions include `expanding(window, freq).mean()`, `expanding(window, freq).sum()`, `expanding(window, freq).std()`, etc., where `window` specifies the size of the expanding window and `freq` specifies the time frequency.

## Q7. Write a code to print only the current month and year at the time of answering this question.
[Hint: Use pandas.datetime function]

In [22]:
from datetime import datetime

curr_time=datetime.now()
date=pd.to_datetime(str(curr_time.year)+'-'+str(curr_time.month))
print('current year: ',date.year)
print('current month: ',date.strftime('%B'))

current year:  2023
current month:  July


## Q8. Write a Python program that takes in two dates as input (in the format YYYY-MM-DD) and calculates the difference between them in days, hours, and minutes using Pandas time delta. The program should prompt the user to enter the dates and display the result.

In [27]:
user_date1=pd.to_datetime(input("Enter date 1: "))
user_date2=pd.to_datetime(input("Enter date 2: "))
time_diff=user_date2-user_date1

print("Difference between the two dates:")
print(f"Days: {time_diff.days}")
print(f"Hours: {time_diff.seconds // 3600}")
print(f"Minutes: {(time_diff.seconds % 3600) // 60}")


Enter date 1: 2023-11-01
Enter date 2: 2023-11-05
Difference between the two dates:
Days: 4
Hours: 0
Minutes: 0


## Q9. Write a Python program that reads a CSV file containing categorical data and converts a specified column to a categorical data type. The program should prompt the user to enter the file path, column name, and category order, and then display the sorted data.

In [None]:
file_path = input("Enter the file path (CSV file): ")

df = pd.read_csv(file_path)
column_name = input("Enter the column name: ")
df[column_name] = pd.Categorical(df[column_name])

category_order = input("Enter the category order (comma-separated values): ")

df[column_name] = df[column_name].cat.set_categories(category_order.split(','))

sorted_df = df.sort_values(by=column_name)

print("Sorted Data:")
print(sorted_df)


## Q10. Write a Python program that reads a CSV file containing sales data for different products and visualizes the data using a stacked bar chart to show the sales of each product category over time. The program should prompt the user to enter the file path and display the chart.

In [None]:
import matplotlib.pyplot as plt

file_path = input("Enter the file path (CSV file): ")

df = pd.read_csv(file_path)

df['Date'] = pd.to_datetime(df['Date'])

grouped_data = df.groupby(['Product Category', 'Date'])['Sales'].sum()

plt.figure(figsize=(10, 6))
grouped_data.plot(kind='bar', stacked=True)
plt.xlabel('Date')
plt.ylabel('Sales')
plt.title('Sales of Each Product Category Over Time')
plt.legend(loc='upper left')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


## Q11. You are given a CSV file containing student data that includes the student ID and their test score. Write a Python program that reads the CSV file, calculates the mean, median, and mode of the test scores, and displays the results in a table.

In [None]:
df=pd.read_csv(input("Enter path of csv file: "))

mean=df['Test Score'].mean()
median=df['Test Score'].median()
mode=df['Test Score'].mode()

print('+-----------+--------+')
print('| Statistic | Value |')
print('+-----------+--------+')
print(f'| Mean | {mean} |')
print(f'| Median | {median} |')
print(f'| Mode | {mode} |')
print('+-----------+--------+')