In [2]:
import pandas as pd
df=pd.DataFrame({'Course_name':['Data Science', 'Machine Learning', 'Big Data', 'Data Engineer'],
                'Duration':[2,3,6,4]
                })
print(df)

        Course_name  Duration
0      Data Science         2
1  Machine Learning         3
2          Big Data         6
3     Data Engineer         4


In [5]:
second_row_data = df.iloc[1]
print(second_row_data)

Course_name    Machine Learning
Duration                      3
Name: 1, dtype: object


To print the data present in the second row of the DataFrame "df", you can use the ".iloc" attribute. In Python, indices start from 0, so to access the second row, you would use index 1.

In pandas, the loc and iloc functions are used to access and retrieve data from a DataFrame, but they have some key differences in how they work:

1.'loc': The loc function is primarily label-based. It is used to access data using the labels of rows and columns. You can pass row and column labels to loc to select specific rows and columns, or use slices or boolean conditions to filter the data. The syntax for loc is df.loc[row_label, column_label] or df.loc[row_label, column_label_list]. It includes the last element when using slices.

Example usage of 'loc':

2.'iloc': The iloc function is primarily integer-based. It is used to access data using integer indices of rows and columns. You can pass integer indices or slices to iloc to select specific rows and columns. The syntax for iloc is df.iloc[row_index, column_index] or df.iloc[row_index, column_index_list]. It excludes the last element when using slices.

Example usage of iloc:

The main difference between loc and iloc is that loc uses label-based indexing, whereas iloc uses integer-based indexing. Therefore, loc is used when you want to access data by labels, while iloc is used when you want to access data by integer indices.

To reindex the given DataFrame df1 using the variable reindex = [3, 0, 1, 2] and store it in the variable new_df, you can use the reindex function in pandas. Then you can retrieve the output for both new_df.loc[2] and new_df.iloc[2] to observe any differences.

Here's the code to achieve that:

In [6]:
import pandas as pd
import numpy as np

columns = ['column_1', 'column_2', 'column_3', 'column_4', 'column_5', 'column_6']
indices = [1, 2, 3, 4, 5, 6]
# Creating a DataFrame
df1 = pd.DataFrame(np.random.rand(6, 6), columns=columns, index=indices)

# Reindexing the DataFrame
reindex = [3, 0, 1, 2]
new_df = df1.reindex(reindex)

# Accessing the output for new_df.loc[2]
output_loc = new_df.loc[2]
print("Output for new_df.loc[2]:")
print(output_loc)

# Accessing the output for new_df.iloc[2]
output_iloc = new_df.iloc[2]
print("\nOutput for new_df.iloc[2]:")
print(output_iloc)


Output for new_df.loc[2]:
column_1    0.998522
column_2    0.421391
column_3    0.644624
column_4    0.955271
column_5    0.436705
column_6    0.140970
Name: 2, dtype: float64

Output for new_df.iloc[2]:
column_1    0.970464
column_2    0.470229
column_3    0.743775
column_4    0.674579
column_5    0.248686
column_6    0.854871
Name: 1, dtype: float64


Observations:

The output for new_df.loc[2] and new_df.iloc[2] are different.

** new_df.loc[2] returns the row with label 2 from the new_df DataFrame, preserving the original index labels. It matches the row with the label 2 from the original DataFrame (df1).

** new_df.iloc[2] returns the row at index 2 in the new_df DataFrame, using integer-based indexing. It does not consider the original index labels and instead uses the reindexed index positions.

In the given code, after reindexing df1 with [3, 0, 1, 2], the row originally at index 2 in df1 now becomes the row at index 1 in new_df. Hence, when accessing new_df.loc[2], it returns the row at label 2 in new_df, whereas new_df.iloc[2] returns the row at index 2 in new_df, which corresponds to the row at label 1 in the original df1.

Therefore, the difference in output arises due to the distinction between label-based indexing (loc) and integer-based indexing (iloc).

In [7]:
column_means=df1.mean()
print('Mean of each column:')
print(column_means)

column_2_std=df1['column_2'].std()
print('\nStandard deviation of column:')
print(column_2_std)



Mean of each column:
column_1    0.683559
column_2    0.422462
column_3    0.608811
column_4    0.572995
column_5    0.379082
column_6    0.603310
dtype: float64

Standard deviation of column:
0.09413747469298855


In [9]:
# Trying to replace the data in the second row of 'column_2' with a string
df1.loc[2, 'column_2'] = 'string_value'

# Calculating the mean of 'column_2'
column_2_mean = df1['column_2'].mean()
print("Mean of 'column_2':", column_2_mean)

TypeError: unsupported operand type(s) for +: 'float' and 'str'

The error message indicates that the string value 'string_value' cannot be converted to a float, which is the expected data type for numerical operations such as calculating the mean.

To avoid this error, you should replace the data with a valid numeric value. If you want to represent a missing or non-numeric value, you can use NaN (Not a Number), which is a special value in pandas to indicate missing or undefined data.

Here's an example code snippet that replaces the data in the second row of 'column_2' with NaN and calculates the mean:

In [10]:
# Replacing the data in the second row of 'column_2' with NaN
df1.loc[2, 'column_2'] = np.nan

# Calculating the mean of 'column_2' (excluding NaN values)
column_2_mean = df1['column_2'].mean()
print("Mean of 'column_2':", column_2_mean)


Mean of 'column_2': 0.4226765411053227


In pandas, the window functions provide a way to perform calculations on a specific window or subset of data within a DataFrame. These functions are especially useful for tasks such as rolling calculations, cumulative calculations, and various statistical analyses.

The window functions in pandas can be accessed through the rolling(), expanding(), and ewm() methods. Here's a brief overview of each type of window function:

1.Rolling Windows:

a.The rolling() function creates a window of a fixed size and performs calculations on that window as it moves along the data.

b.Common calculations include moving averages, rolling sums, and standard deviations.

c.Rolling windows can be applied to both DataFrame and Series objects.

d.Example: df.rolling(window=3).mean()

2.Expanding Windows:

a.The expanding() function creates a window that expands over time, starting from the beginning of the data and incorporating more observations as it progresses.

b.It calculates statistics on the entire expanding window, including all previous observations.

c.Common calculations include cumulative sums, cumulative products, and exponentially weighted moving averages.

d.Expanding windows can be applied to both DataFrame and Series objects.

e.Example: df.expanding().sum()

3.Exponentially Weighted Windows:

a.The ewm() function calculates exponentially weighted statistics over a specified window.

b.It assigns weights to observations in the window based on an exponential decay factor, giving more weight to recent observations.

c.Common calculations include exponentially weighted moving averages and exponentially weighted standard deviations.

d.Exponentially weighted windows can be applied to both DataFrame and Series objects.

e.Example: df.ewm(span=3).mean()

These window functions offer flexibility in performing various calculations on subsets of data within a DataFrame, allowing for efficient analysis and exploration of time series and other ordered data.

It's important to note that when using window functions, it's crucial to consider the window size, which determines the number of data points included in each calculation. The appropriate window size depends on the specific analysis and the characteristics of the data being analyzed.

To print the current month and year, you can use the datetime module from the pandas library in Python. The datetime module provides various functions to work with dates and times.

Here's the code to print the current month and year:

In [6]:
import pandas as pd
current_date=pd.datetime.now()
current_month=current_date.month
current_year=current_date.year
# Print the current month and year
print("Current Month:", current_month)
print("Current Year:", current_year)

Current Month: 7
Current Year: 2023


  current_date=pd.datetime.now()


The pd.datetime.now() function retrieves the current date and time. Then, we extract the month and year using the month and year attributes of the datetime object.

Note that the pd.datetime function is used to create a datetime object within the pandas library. However, starting from pandas version 1.3.0, the recommended way to create a datetime object is to use the pd.Timestamp function. So, for the above code, you can also replace pd.datetime.now() with pd.Timestamp.now() for better compatibility with the latest versions of pandas.

In [9]:
import pandas as pd
date1=input("Enter the first date (YYYY-MM-DD):")
date2=input("Enter the Second date (YYYY-MM-DD):")

# Convert the input strings to pandas Timestamp objects
timestamp1 = pd.Timestamp(date1)
timestamp2 = pd.Timestamp(date2)

# Calculate the difference between the two dates using pandas Timedelta
time_difference=timestamp2-timestamp1

# Extract the days, hours, and minutes from the time difference
days=time_difference.days
hours=time_difference.seconds//3600
minutes=(time_difference.seconds/60)%60

# Display the result
print("Difference between the two dates:")
print("Days:", days)
print("Hours:", hours)
print("Minutes:", minutes)

Enter the first date (YYYY-MM-DD): 2023-07-04
Enter the Second date (YYYY-MM-DD): 2023-07-10


Difference between the two dates:
Days: 6
Hours: 0
Minutes: 0.0


In this program, the user is prompted to enter two dates in the format YYYY-MM-DD. The input strings are converted to pandas Timestamp objects using pd.Timestamp(). Then, the difference between the two dates is calculated using timestamp2 - timestamp1, resulting in a pandas Timedelta object.

The Timedelta object contains the difference in days, seconds, and microseconds. We extract the number of days using the days attribute of the Timedelta object. For the hours and minutes, we perform calculations using the seconds attribute of the Timedelta object. The number of seconds is divided by 3600 to get the hours and by 60 to get the minutes.

Finally, the program displays the difference between the two dates in terms of days, hours, and minutes.

In [22]:
import pandas as pd

# Prompt the user to enter the file path, column name, and category order
file_path = input("Enter the file path of the CSV file: ")
column_name = input("Enter the name of the column to convert: ")
category_order = input("Enter the category order (comma-separated): ")

# Read the CSV file into a DataFrame
df = pd.read_csv(file_path)

# Convert the specified column to a categorical data type with the specified category order
category_order_list = category_order.split(',')
df[column_name] = pd.Categorical(df[column_name], categories=category_order_list, ordered=True)

# Sort the DataFrame based on the specified column
sorted_df = df.sort_values(column_name)

# Display the sorted data
print("Sorted Data:")
print(sorted_df)
sorted_df.head()


Enter the file path of the CSV file:  players_data.csv
Enter the name of the column to convert:  Player
Enter the category order (comma-separated):  Age


Sorted Data:
      Rk Player Pos Age   Tm   G  GS    MP   FG  FGA  ...   FT%  ORB  DRB  \
0      1    NaN  PF  24  NYK  68  22  1287  152  331  ...  .784   79  222   
1      2    NaN  SG  20  MEM  30   0   248   35   86  ...  .609    9   19   
2      3    NaN   C  21  OKC  70  67  1771  217  399  ...  .502  199  324   
3      4    NaN  PF  28  MIN  17   0   215   19   44  ...  .579   23   54   
4      5    NaN  SG  29  TOT  78  72  2502  375  884  ...  .843   27  220   
..   ...    ...  ..  ..  ...  ..  ..   ...  ...  ...  ...   ...  ...  ...   
670  490    NaN  PF  26  TOT  76  68  2434  451  968  ...  .655  127  284   
671  490    NaN  PF  26  MIN  48  48  1605  289  641  ...  .682   75  170   
672  490    NaN  PF  26  BRK  28  20   829  162  327  ...  .606   52  114   
673  491    NaN   C  22  CHO  62  45  1487  172  373  ...  .774   97  265   
674  492    NaN   C  25  BOS  82  59  1731  340  619  ...  .823  146  319   

     TRB  AST  STL BLK  TOV   PF   PTS  
0    301   68   27  2

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,1,,PF,24,NYK,68,22,1287,152,331,...,0.784,79,222,301,68,27,22,60,147,398
1,2,,SG,20,MEM,30,0,248,35,86,...,0.609,9,19,28,16,16,7,14,24,94
2,3,,C,21,OKC,70,67,1771,217,399,...,0.502,199,324,523,66,38,86,99,222,537
3,4,,PF,28,MIN,17,0,215,19,44,...,0.579,23,54,77,15,4,9,9,30,60
4,5,,SG,29,TOT,78,72,2502,375,884,...,0.843,27,220,247,129,41,7,116,167,1035


In this program, the user is prompted to enter the file path of the CSV file, the name of the column to convert, and the category order (comma-separated). The CSV file is read into a DataFrame using pd.read_csv().

Then, the specified column is converted to a categorical data type using pd.Categorical(). The categories parameter is set to the category_order_list which is created by splitting the input category_order using commas. The ordered parameter is set to True to indicate that the categories have a specific order.

After converting the column, the DataFrame is sorted based on the specified column using df.sort_values(). The sorted DataFrame is stored in the sorted_df variable.

Finally, the program displays the sorted data using print(sorted_df).

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Prompt the user to enter the file path of the CSV file
file_path = input("Enter the file path of the CSV file: ")

# Read the CSV file into a DataFrame
df = pd.read_csv(file_path)

# Assuming the CSV file has columns 'Date', 'Product_Category', and 'Sales'
# If column names differ, modify the code accordingly

# Convert the 'Date' column to datetime data type
df['Date'] = pd.to_datetime(df['Date'])

# Group the data by 'Date' and 'Product_Category' and sum the 'Sales' for each category
sales_by_category = df.groupby(['Date', 'Product_Category'])['Sales'].sum().unstack()

# Create a stacked bar chart
ax = sales_by_category.plot(kind='bar', stacked=True, figsize=(10, 6))

# Set the title and labels for axes
plt.title("Sales by Product Category over Time")
plt.xlabel("Date")
plt.ylabel("Sales")

# Show the legend
plt.legend(title="Product Category", loc='upper left')

# Show the chart
plt.show()


To calculate the mean, median, and mode of test scores from a CSV file containing student data and display the results in a table, you can use the pandas library in Python. Here's a Python program that accomplishes this:

In [26]:
import pandas as pd
df=pd.DataFrame({'Student ID':(1,2,3,4,5,6,7,8,9,10),
                          'Test Score':(85,90,80,75,85,82,78,85,90,85)})
print(df)

   Student ID  Test Score
0           1          85
1           2          90
2           3          80
3           4          75
4           5          85
5           6          82
6           7          78
7           8          85
8           9          90
9          10          85


In [27]:
import pandas as pd

# Assuming you have a DataFrame called df

# Specify the file path and name for the output CSV file
output_file = 'df.csv'

# Convert the DataFrame to a CSV file
df.to_csv(output_file, index=False)


In [28]:
import pandas as pd

# Prompt the user to enter the file path of the CSV file
file_path = input("Enter the file path of the CSV file containing the student data: ")

# Read the CSV file into a DataFrame
df = pd.read_csv(file_path)

# Calculate the mean, median, and mode of the test scores
mean_score = df['Test Score'].mean()
median_score = df['Test Score'].median()
mode_scores = df['Test Score'].mode()

# Create a dictionary to hold the results
results = {
    'Statistic': ['Mean', 'Median', 'Mode'],
    'Value': [mean_score, median_score, ', '.join(map(str, mode_scores))]
}

# Create a DataFrame from the results dictionary
results_df = pd.DataFrame(results)

# Display the results table
print(results_df)


Enter the file path of the CSV file containing the student data:  df.csv


  Statistic Value
0      Mean  83.5
1    Median  85.0
2      Mode    85
