# Consider following code to answer further questions:

In [21]:
import pandas as pd

course_name = [ 'Data Science', 'Machine Learning', 'Big Data', 'Data Engineer']

duration =  [2,3,6,4]

df = pd.DataFrame(data = {'course_name' : course_name, 'duration' : duration})

# Q1. Write a code to print the data present in the second row of the dataframe, df.

In [22]:
print(df.iloc[1])

course_name    Machine Learning
duration                      3
Name: 1, dtype: object


# Q2. What is the difference between the functions loc and iloc in pandas.DataFrame?

In pandas, `loc` and `iloc` are two different methods used for accessing data from a DataFrame.

1. **`loc`**:
   - `loc` is label-based indexing, which means you access a DataFrame based on the labels (row and column names).
   - It uses the actual row and column names to select data.
   - The syntax is `df.loc[row_label, column_label]`.
   - Slicing with `loc` is inclusive of both the start and stop indices.
   - Example: `df.loc[1, 'column_name']` selects the value in the row labeled `1` and the column labeled `'column_name'`.

2. **`iloc`**:
   - `iloc` is integer-location based indexing, which means you access a DataFrame based on integer indices.
   - It uses integer indices to select data.
   - The syntax is `df.iloc[row_index, column_index]`.
   - Slicing with `iloc` is exclusive of the stop index (standard Python behavior).
   - Example: `df.iloc[1, 0]` selects the value in the second row and the first column (using zero-based indexing).

In [23]:
print(df.loc[ 1 , 'course_name'])

print(df.iloc[1, 0])

Machine Learning
Machine Learning


# Q3. Reindex the given dataframe using a variable, reindex = [3,0,1,2] and store it in the variable, new_df then find the output for both new_df.loc[2] and new_df.iloc[2].

Did you observe any difference in both the outputs? If so then explain it.

In [33]:
new_df= pd.DataFrame(df, index=[3, 0, 1, 2])

print(new_df.iloc[2],"\n\n",new_df.loc[2] )

course_name    Machine Learning
duration                      3
Name: 1, dtype: object 

 course_name    Big Data
duration              6
Name: 2, dtype: object


- `new_df.loc[2]` prints the row with the label '2'.
- `new_df.iloc[2]` prints the row with the integer index '2'.

# Consider the below code to answer further questions:

In [47]:
import pandas as pd

import numpy as np

columns = ['column_1', 'column_2', 'column_3', 'column_4', 'column_5', 'column_6']

indices = [1,2,3,4,5,6]

df1 = pd.DataFrame(np.random.rand(6,6), columns = columns, index = indices)

# Q4. Write a code to find the following statistical measurements for the above dataframe df1:

# (i) 	mean of each and every column present in the dataframe.
# (ii) standard deviation of column, ‘column_2

In [48]:
# (i) Mean
column_means = df1.mean()
print("Mean of each column:")
print(column_means)

# (ii) Standard deviation
std_column_2 = df1['column_2'].std()
print("\nStandard deviation of column 'column_2':", std_column_2)


Mean of each column:
column_1    0.361776
column_2    0.530504
column_3    0.664343
column_4    0.334440
column_5    0.628932
column_6    0.321354
dtype: float64

Standard deviation of column 'column_2': 0.2747831742196543


# Q5. Replace the data present in the second row of column, ‘column_2’ by a string variable then find the mean of column, column_2.

If you are getting errors in executing it then explain why.

[Hint: To replace the data use df1.loc[] and equate this to string data of your choice.]

If we try to replace the data in the second row of 'column_2' with a string value, we may encounter an error because the data type of 'column_2' is initially numeric (float) due to the use of `np.random.rand`. Pandas does not allow mixing data types within a single column.

In [49]:
df1.loc[2, 'column_2'] = 'string_value'

df1['column_2'] = pd.to_numeric(df1['column_2'], errors='coerce')

mean_column_2 = df1['column_2'].mean()
print("Mean of column 'column_2':", mean_column_2)

Mean of column 'column_2': 0.5815411061078929


  df1.loc[2, 'column_2'] = 'string_value'


# Q6. What do you understand about the windows function in pandas and list the types of windows functions?

In pandas,  window function, also known as a rolling or moving function, is a way to perform operations on a set of data points within a specified window (a fixed-size subset of the data) as it moves through the entire dataset. These functions are useful for calculating various statistical measures or applying custom functions to analyze trends or patterns over time.

1. **Rolling Mean (Moving Average):**
   - Calculates the mean of values in a rolling window.

2. **Rolling Sum:**
   - Calculates the sum of values in a rolling window.

3. **Rolling Standard Deviation:**
   - Calculates the standard deviation of values in a rolling window.

4. **Rolling Min and Max:**
   - Calculates the minimum and maximum values in a rolling window.

5. **Exponential Moving Average (EMA):**
   - Calculates the exponentially weighted moving average.

6. **Custom Window Functions:**
   - You can also define and apply your custom functions using the `apply` method on the rolling window.

# Q7. Write a code to print only the current month and year at the time of answering this question.

[Hint: Use pandas.datetime function]

In [51]:
from datetime import datetime

current_month = pd.to_datetime(datetime.now()).strftime('%B %Y')
print(current_month)

November 2023


# Q8. Write a Python program that takes in two dates as input (in the format YYYY-MM-DD) and calculates the difference between them in days, hours, and minutes using Pandas time delta. The program should prompt the user to enter the dates and display the result.

In [1]:
import pandas as pd

def calculate(start_date, end_date):
    start_datetime = pd.to_datetime(start_date)
    end_datetime = pd.to_datetime(end_date)

    difference = end_datetime - start_datetime
    days = difference.days
    hours, remainder = divmod(difference.seconds, 3600)
    minutes, _ = divmod(remainder, 60)

    return days, hours, minutes

start_date = input("Enter the start date (YYYY-MM-DD): ")
end_date = input("Enter the end date (YYYY-MM-DD): ")

days, hours, minutes = calculate(start_date, end_date)
print(f"\nTime difference between {start_date} and {end_date}:")
print(f"Days: {days} days")
print(f"Hours: {hours} hours")
print(f"Minutes: {minutes} minutes")


Time difference between  and 2000-02-01:
Days: nan days
Hours: nan hours
Minutes: nan minutes


# Q9. Write a Python program that reads a CSV file containing categorical data and converts a specified column to a categorical data type. The program should prompt the user to enter the file path, column name, and category order, and then display the sorted data

In [2]:
import pandas as pd

def convert(dataframe, column_name, category_order):
    dataframe[column_name] = pd.Categorical(dataframe[column_name], categories=category_order, ordered=True)

file_path = input("Enter the CSV file path: ")
column_name = input("Enter the column name to convert to categorical: ")
order = input("Enter the category order (comma-separated): ").split(',')

df = pd.read_csv(file_path)

convert(df, column_name, order)

sorted_df = df.sort_values(by=column_name)
print("\nSorted Data:")
print(sorted_df)



Sorted Data:
    id  location_id  program_id         accepted_payments  \
0    1            1         NaN                       NaN   
1    2            2         NaN                       NaN   
2    3            3         NaN                       NaN   
3    4            4         NaN                       NaN   
4    5            5         NaN                       NaN   
5    6            6         NaN                       NaN   
6    7            7         NaN                       NaN   
7    8            8         NaN                       NaN   
8    9            9         NaN                       NaN   
9   10           10         NaN                       NaN   
10  11           11         NaN                       NaN   
11  12           12         NaN                       NaN   
12  13           13         NaN                       NaN   
13  14           14         NaN                       NaN   
14  15           15         NaN                       NaN   
15  16    

# Q10. Write a Python program that reads a CSV file containing sales data for different products and visualizes the data using a stacked bar chart to show the sales of each product category over time. The program should prompt the user to enter the file path and display the chart.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

path = input("Enter the CSV file path: ")

df = pd.read_csv(path)

if 'Date' not in df.columns or 'Product_Category' not in df.columns or 'Sales' not in df.columns:
    print("The CSV file should contain columns 'Date', 'Product_Category', and 'Sales'.")
    exit(1)

df['Date'] = pd.to_datetime(df['Date'])

pivot_df = df.pivot(index='Date', columns='Product_Category', values='Sales')

plt.figure(figsize=(10, 6))
pivot_df.plot(kind='bar', stacked=True)
plt.title('Stacked Bar Chart of Sales by Product Category Over Time')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.show()


# Q11. You are given a CSV file containing student data that includes the student ID and their test score. Write a Python program that reads the CSV file, calculates the mean, median, and mode of the test scores, and displays the results in a table.

In [7]:
import pandas as pd

df = pd.read_csv(input("Enter the file path of the CSV : "))

mean = df['Test Score'].mean()
median = df['Test Score'].median()
mode = df['Test Score'].mode()

result_table = pd.DataFrame({
    'Statistic': ['Mean', 'Median', 'Mode'],
    'Value': [mean, median, ', '.join(map(str, mode))]
})

print(result_table)

   Student ID  Test Score
0           1          85
1           2          90
2           3          80
3           4          75
4           5          85
5           6          82
6           7          78
7           8          85
8           9          90
9          10          85
