Q1. List any five functions of the pandas library with execution.

In [None]:
Certainly! Pandas is a popular Python library for data manipulation and analysis. Here are five common functions from the pandas library along with example code executions:

1. read_csv: Used to read data from a CSV file into a DataFrame.

import pandas as pd

# Read a CSV file into a DataFrame
df = pd.read_csv('data.csv')

2. head: Displays the first few rows of a DataFrame to get a quick overview of the data.

# Display the first 5 rows of the DataFrame
df.head()

3. info: Provides information about the DataFrame, including data types and missing values.

# Get information about the DataFrame
df.info()

4. describe: Generates summary statistics of numerical columns in the DataFrame.

# Generate summary statistics for numeric columns
df.describe()

5. groupby: Used for grouping data based on one or more columns and performing operations on those groups.

# Group data by a column and calculate the mean of another column within each group
group = df.groupby('Category')['Price'].mean()

These are just a few examples of the many functions provided by the pandas library for data manipulation and analysis in Python.

Q2. Given a Pandas DataFrame df with columns 'A', 'B', and 'C', write a Python function to re-index the
DataFrame with a new index that starts from 1 and increments by 2 for each row.

In [None]:
You can re-index a Pandas DataFrame with a custom index starting from 1 and incrementing by 2 for each row using the set_index function. Here's a Python function to do that:

python

import pandas as pd

def reindex_with_custom_index(df):
    # Create a new index starting from 1 and incrementing by 2
    new_index = pd.Index(range(1, len(df) * 2, 2), name='CustomIndex')
    
    # Set the new index for the DataFrame
    df = df.set_index(new_index)
    
    return df

# Example usage:
data = {'A': [10, 20, 30], 'B': [40, 50, 60], 'C': [70, 80, 90]}
df = pd.DataFrame(data)
df = reindex_with_custom_index(df)
print(df)

In this function, we first create a new custom index using pd.Index with a range starting from 1 and incrementing by 2. Then, we set this new index for the DataFrame using set_index. The resulting DataFrame will have the custom index as specified in the function.

Q3. You have a Pandas DataFrame df with a column named 'Values'. Write a Python function that
iterates over the DataFrame and calculates the sum of the first three values in the 'Values' column. The
function should print the sum to the console.

For example, if the 'Values' column of df contains the values [10, 20, 30, 40, 50], your function should
calculate and print the sum of the first three values, which is 60.

In [None]:
You can achieve this by using a Python function that iterates over the DataFrame and calculates the sum of the first three values in the 'Values' column. Here's a Python function that does that:

python

import pandas as pd

def calculate_sum_of_first_three_values(df):
    # Extract the 'Values' column as a Pandas Series
    values_column = df['Values']
    
    # Calculate the sum of the first three values
    sum_of_first_three_values = values_column.head(3).sum()
    
    # Print the result to the console
    print("Sum of the first three values:", sum_of_first_three_values)

# Example usage:
data = {'Values': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)
calculate_sum_of_first_three_values(df)

In this function, we first extract the 'Values' column as a Pandas Series using df['Values']. Then, we use the head(3) method to select the first three values in the Series and calculate their sum using the sum() method. Finally, we print the result to the console.

Q4. Given a Pandas DataFrame df with a column 'Text', write a Python function to create a new column
'Word_Count' that contains the number of words in each row of the 'Text' column.

In [None]:
You can create a new column 'Word_Count' in a Pandas DataFrame based on the number of words in each row of the 'Text' column using the apply function along with a custom function that counts words. Here's a Python function to achieve this:

python

import pandas as pd

def count_words(text):
    # Split the text into words using whitespace as the separator and count the words
    words = text.split()
    return len(words)

def add_word_count_column(df):
    # Apply the count_words function to each row in the 'Text' column and create a new 'Word_Count' column
    df['Word_Count'] = df['Text'].apply(count_words)
    return df

# Example usage:
data = {'Text': ["This is a sample text", "Hello, world!", "Python programming is fun"]}
df = pd.DataFrame(data)
df = add_word_count_column(df)
print(df)

In this code, we first define a count_words function that splits the input text into words using whitespace as the separator and returns the count of words. Then, we use the apply function to apply this function to each row in the 'Text' column of the DataFrame and create a new 'Word_Count' column.

Q5. How are DataFrame.size() and DataFrame.shape() different?

In [None]:
In Pandas, DataFrame.size and DataFrame.shape are two different attributes used to retrieve information about the dimensions of a DataFrame, but they provide different information:

DataFrame.size:

1. DataFrame.size returns the total number of elements in the DataFrame, which is equivalent to the product of the number of rows and the number of columns.
It represents the total number of data points or cells in the DataFrame, including all rows and columns.
It returns a single integer value.
Example:

import pandas as pd

data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)

size = df.size  # Returns 6 (2 columns * 3 rows = 6 elements)

2. DataFrame.shape:

DataFrame.shape returns a tuple representing the dimensions of the DataFrame, where the first element of the tuple is the number of rows, and the second element is the number of columns.
It provides a more detailed breakdown of the DataFrame's structure, specifying the number of rows and columns separately.
It returns a tuple of two integers.
Example:

import pandas as pd

data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)
shape = df.shape  # Returns (3, 2) indicating 3 rows and 2 columns

In summary, DataFrame.size gives you the total number of elements in the DataFrame, while DataFrame.shape provides the number of rows and columns as a tuple. Depending on your needs, you may use one or the other to obtain the desired information about the DataFrame's dimensions.

Q6. Which function of pandas do we use to read an excel file?

In [None]:
In Pandas, you can use the read_excel function to read data from an Excel file. This function allows you to read data from Excel spreadsheets and create a Pandas DataFrame. Here's how you can use it:

python

import pandas as pd

# Read data from an Excel file into a DataFrame
df = pd.read_excel('example.xlsx')  # Replace 'example.xlsx' with the name of your Excel file

The read_excel function provides various options and parameters that allow you to specify which sheet to read, skip rows, specify columns, and more. You can customize the behavior of the function according to your specific needs when working with Excel files.

Q7. You have a Pandas DataFrame df that contains a column named 'Email' that contains email
addresses in the format 'username@domain.com'. Write a Python function that creates a new column
'Username' in df that contains only the username part of each email address.

The username is the part of the email address that appears before the '@' symbol. For example, if the
email address is 'john.doe@example.com', the 'Username' column should contain 'john.doe'. Your
function should extract the username from each email address and store it in the new 'Username'
column.

In [None]:
You can create a new 'Username' column in a Pandas DataFrame by applying a custom function that extracts the username from each email address. Here's a Python function to achieve this:

python

import pandas as pd

def extract_username(email):
    # Split the email address using '@' and return the first part (the username)
    return email.split('@')[0]

def add_username_column(df):
    # Apply the extract_username function to each row in the 'Email' column and create a new 'Username' column
    df['Username'] = df['Email'].apply(extract_username)
    return df

# Example usage:
data = {'Email': ['john.doe@example.com', 'jane.smith@example.com', 'bob@example.com']}
df = pd.DataFrame(data)
df = add_username_column(df)
print(df)

In this code, we define the extract_username function, which splits the input email address using the '@' symbol and returns the first part (the username). Then, we use the apply function to apply this function to each row in the 'Email' column of the DataFrame and create a new 'Username' column. The resulting DataFrame will contain the extracted usernames.

Q8. You have a Pandas DataFrame df with columns 'A', 'B', and 'C'. Write a Python function that selects
all rows where the value in column 'A' is greater than 5 and the value in column 'B' is less than 10. The
function should return a new DataFrame that contains only the selected rows.

For example, if df contains the following values:
A B C
0 3 5 1
1 8 2 7
2 6 9 4
3 2 3 5
4 9 1 2

Your function should select the following rows: A B C
1 8 2 7
4 9 1 2
The function should return a new DataFrame that contains only the selected rows.

In [None]:
You can use the Pandas DataFrame's boolean indexing to select rows that meet specific conditions. Here's a Python function to select rows where the value in column 'A' is greater than 5 and the value in column 'B' is less than 10:

python

import pandas as pd

def select_rows(df):
    # Use boolean indexing to select rows based on the conditions
    selected_rows = df[(df['A'] > 5) & (df['B'] < 10)]
    return selected_rows

# Example usage:
data = {'A': [3, 8, 6, 2, 9],
        'B': [5, 2, 9, 3, 1],
        'C': [1, 7, 4, 5, 2]}
df = pd.DataFrame(data)
selected_df = select_rows(df)
print(selected_df)

In this code, we define the select_rows function, which uses boolean indexing to create a new DataFrame selected_rows that contains only the rows where 'A' is greater than 5 and 'B' is less than 10. The resulting DataFrame selected_df will contain only the selected rows that meet the specified conditions.

Q9. Given a Pandas DataFrame df with a column 'Values', write a Python function to calculate the mean,
median, and standard deviation of the values in the 'Values' column.

In [None]:

You can calculate the mean, median, and standard deviation of the values in the 'Values' column of a Pandas DataFrame using built-in functions from the Pandas library. Here's a Python function that does this:

python

import pandas as pd

def calculate_statistics(df):
    # Calculate the mean, median, and standard deviation
    mean_value = df['Values'].mean()
    median_value = df['Values'].median()
    std_deviation = df['Values'].std()
    
    return mean_value, median_value, std_deviation

# Example usage:
data = {'Values': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)
mean, median, std_dev = calculate_statistics(df)

print("Mean:", mean)
print("Median:", median)
print("Standard Deviation:", std_dev)

In this code, we define the calculate_statistics function, which calculates the mean, median, and standard deviation of the 'Values' column in the DataFrame. We use the mean(), median(), and std() functions provided by Pandas to compute these statistics. The function returns the calculated values, which you can then print or use as needed.

Q10. Given a Pandas DataFrame df with a column 'Sales' and a column 'Date', write a Python function to
create a new column 'MovingAverage' that contains the moving average of the sales for the past 7 days
for each row in the DataFrame. The moving average should be calculated using a window of size 7 and
should include the current day.

In [None]:
You can calculate the moving average of the 'Sales' column for the past 7 days, including the current day, in a Pandas DataFrame using the rolling function. Here's a Python function to create a new column 'MovingAverage' with the moving averages:

python

import pandas as pd

def calculate_moving_average(df):
    # Sort the DataFrame by 'Date' if it's not already sorted
    df = df.sort_values(by='Date')
    
    # Calculate the moving average with a window of size 7, including the current day
    df['MovingAverage'] = df['Sales'].rolling(window=7, min_periods=1).mean()
    
    return df

# Example usage:
data = {'Date': ['2023-09-01', '2023-09-02', '2023-09-03', '2023-09-04', '2023-09-05', '2023-09-06', '2023-09-07'],
        'Sales': [100, 120, 130, 140, 110, 150, 160]}
df = pd.DataFrame(data)

# Convert the 'Date' column to datetime if it's not already
df['Date'] = pd.to_datetime(df['Date'])

df = calculate_moving_average(df)
print(df)

In this code, we first ensure that the DataFrame is sorted by the 'Date' column because the rolling calculation depends on the order of the data. Then, we use the rolling function with a window size of 7 and min_periods=1 to calculate the moving average, which includes the current day. The results are stored in a new 'MovingAverage' column in the DataFrame.

Q11. You have a Pandas DataFrame df with a column 'Date'. Write a Python function that creates a new
column 'Weekday' in the DataFrame. The 'Weekday' column should contain the weekday name (e.g.
Monday, Tuesday) corresponding to each date in the 'Date' column.

For example, if df contains the following values:
Date
0 2023-01-01
1 2023-01-02
2 2023-01-03
3 2023-01-04
4 2023-01-05
Your function should create the following DataFrame:

Date Weekday
0 2023-01-01 Sunday
1 2023-01-02 Monday
2 2023-01-03 Tuesday
3 2023-01-04 Wednesday
4 2023-01-05 Thursday
The function should return the modified DataFrame.

In [None]:
You can create a new 'Weekday' column in a Pandas DataFrame to contain the weekday names corresponding to the dates in the 'Date' column using the dt.strftime function. Here's a Python function to achieve this:

python

import pandas as pd

def add_weekday_column(df):
    # Convert the 'Date' column to datetime if it's not already
    df['Date'] = pd.to_datetime(df['Date'])
    
    # Extract the weekday names and create the 'Weekday' column
    df['Weekday'] = df['Date'].dt.strftime('%A')
    
    return df

# Example usage:
data = {'Date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05']}
df = pd.DataFrame(data)
df = add_weekday_column(df)
print(df)

In this code, we first convert the 'Date' column to a datetime format using pd.to_datetime. Then, we use the .dt.strftime('%A') method to extract the weekday names in the format 'Sunday', 'Monday', etc., and create the 'Weekday' column. The resulting DataFrame will have the desired 'Weekday' column added to it.

Q12. Given a Pandas DataFrame df with a column 'Date' that contains timestamps, write a Python
function to select all rows where the date is between '2023-01-01' and '2023-01-31'.

In [None]:
You can select rows from a Pandas DataFrame based on a date range using boolean indexing with the datetime module to compare dates. Here's a Python function to select rows where the date in the 'Date' column falls between '2023-01-01' and '2023-01-31':

python

import pandas as pd

def select_rows_in_date_range(df):
    # Convert the 'Date' column to datetime if it's not already
    df['Date'] = pd.to_datetime(df['Date'])
    
    # Define the date range
    start_date = pd.to_datetime('2023-01-01')
    end_date = pd.to_datetime('2023-01-31')
    
    # Use boolean indexing to select rows within the date range
    selected_rows = df[(df['Date'] >= start_date) & (df['Date'] <= end_date)]
    
    return selected_rows

# Example usage:
data = {'Date': ['2023-01-01', '2023-01-15', '2023-01-31', '2023-02-10', '2023-03-05']}
df = pd.DataFrame(data)

df_selected = select_rows_in_date_range(df)
print(df_selected)

In this code, we first convert the 'Date' column to datetime format using pd.to_datetime. Then, we define the start and end dates of the date range. Finally, we use boolean indexing to select rows that fall within the specified date range, and the resulting DataFrame df_selected contains only those rows.