Q 1 List any five functions of the pandas library with execution.

Ans - Sure! Here are five functions from the pandas library along with their execution examples:

(1) read_csv(): This function is used to read data from a CSV file and create a DataFrame.

In [None]:
import pandas as pd

# Reading data from a CSV file and creating a DataFrame
df = pd.read_csv('data.csv')

# Print the first few rows of the DataFrame
print(df.head())


(2) groupby(): This function is used to group data based on one or more columns and perform aggregate operations on the groups.

In [None]:
import pandas as pd

# Assuming df is a DataFrame containing 'Name', 'Age', and 'Score' columns

# Group data by 'Age' and calculate the average 'Score' for each age group
grouped_df = df.groupby('Age')['Score'].mean()

# Print the result
print(grouped_df)


(3) merge(): This function is used to merge two DataFrames based on a common key or index.

In [None]:
import pandas as pd

# Assuming df1 and df2 are two DataFrames with a common column 'ID'

# Merge the DataFrames on the 'ID' column
merged_df = pd.merge(df1, df2, on='ID')

# Print the merged DataFrame
print(merged_df)


(4) pivot_table(): This function is used to create a pivot table from a DataFrame

In [None]:
import pandas as pd

# Assuming df is a DataFrame containing 'Name', 'Age', 'Score', and 'Grade' columns

# Create a pivot table with 'Age' as rows, 'Grade' as columns, and 'Score' as values
pivot_table = df.pivot_table(index='Age', columns='Grade', values='Score', aggfunc='mean')

# Print the pivot table
print(pivot_table)


(5) drop_duplicates(): This function is used to remove duplicate rows from a DataFrame.

In [None]:
import pandas as pd

# Assuming df is a DataFrame with possible duplicate rows

# Drop duplicate rows based on all columns
deduplicated_df = df.drop_duplicates()

# Print the DataFrame after removing duplicates
print(deduplicated_df)


Q 2 Given a Pandas DataFrame df with columns 'A', "B', and 'C',write a Python function to re-index the DataFrame with a new index that starts from 1 and increments by 2 for each row.

Ans - To re-index the DataFrame with a new index that starts from 1 and increments by 2 for each row, you can use the reset_index() function in pandas along with some additional steps. Here's a Python function that achieves this:

In [3]:
import pandas as pd

def reindex_dataframe(df):
    # Reset the existing index to default integer index
    df_reset = df.reset_index(drop=True)

    # Create a new index starting from 1 and incrementing by 2 for each row
    new_index = pd.Series(range(1, len(df_reset)*2, 2))

    # Assign the new index to the DataFrame
    df_reset['new_index'] = new_index

    # Set the new index as the index of the DataFrame
    df_reset.set_index('new_index', inplace=True)

    return df_reset

# Example DataFrame with columns 'A', 'B', and 'C'
data = {
    'A': [10, 20, 30, 40],
    'B': [100, 200, 300, 400],
    'C': [1000, 2000, 3000, 4000]
}
df = pd.DataFrame(data)

# Call the reindex_dataframe function and print the result
df_reindexed = reindex_dataframe(df)
print(df_reindexed)


            A    B     C
new_index               
1          10  100  1000
3          20  200  2000
5          30  300  3000
7          40  400  4000


In this function, we first reset the existing index of the DataFrame using reset_index(drop=True), which creates a default integer index starting from 0. Then, we create a new index using pd.Series(range(1, len(df_reset)*2, 2)), which generates a Series of numbers starting from 1 and incrementing by 2 for each row.

Next, we add the new index as a new column to the DataFrame (df_reset['new_index'] = new_index) and set it as the index of the DataFrame using df_reset.set_index('new_index', inplace=True).

The final result is the DataFrame with the new index starting from 1 and incremented by 2 for each row.

Q 3 You have a Pandas DataFrame df with a column named 'Values'. Write a Python function that iterates over the DataFrame and calculates the sum of the first three values in the 'Values' column. The function should print the sum to the console.

Ans - You can achieve this without explicitly iterating over the DataFrame using the pandas built-in functions. The pandas library provides vectorized operations, which are generally more efficient and faster than explicit iteration. Here's how you can calculate the sum of the first three values in the 'Values' column without explicit iteration:

In [4]:
import pandas as pd

def calculate_sum_of_first_three(df):
    # Calculate the sum of the first three values in the 'Values' column
    sum_first_three = df['Values'].head(3).sum()

    # Print the sum to the console
    print("Sum of the first three values:", sum_first_three)

# Example DataFrame with a 'Values' column
data = {
    'Values': [10, 20, 30, 40, 50]
}
df = pd.DataFrame(data)

# Call the calculate_sum_of_first_three function
calculate_sum_of_first_three(df)


Sum of the first three values: 60


In this function, we use the pandas head(3) method to select the first three rows from the 'Values' column and then use the .sum() method to calculate their sum. This approach is more concise, efficient, and idiomatic in pandas, avoiding the need for explicit iteration.






Q 4 Given a Pandas DataFrame df with a column 'Text', write a Python function to create a new column 'Word_Count' that contains the number of words in each row of the 'Text' column.

In [None]:
import pandas as pd
import re

def count_words(text):
    words = re.findall(r'\w+', text)
    return len(words)

def add_word_count_column(df):
    df['Word_Count'] = df['Text'].apply(count_words)
    return df

# Example DataFrame
data = {'Text': ["This is a sample text.", "Hello, how are you?", "Python programming is fun!"]}
df = pd.DataFrame(data)

# Add word count column
df = add_word_count_column(df)

print(df)


In this example, the count_words function uses regular expressions to split the input text into words and then calculates the length of the resulting list of words. The add_word_count_column function applies the count_words function to each row in the 'Text' column of the DataFrame and adds a new column 'Word_Count' with the calculated word counts.

Q 5 How are DataFrame.size() and DataFrame.shape() different?

Ans - Both DataFrame.size and DataFrame.shape are attributes in Pandas DataFrames, but they provide different information about the structure of the DataFrame.

(1) DataFrame.size:
* DataFrame.size returns the total number of elements in the DataFrame, which is calculated as the product of the number of rows and the number of columns.
* It gives you the total count of cells in the DataFrame, including empty cells or cells containing missing values (NaN).

Example:

In [None]:
import pandas as pd

data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)

total_elements = df.size
print(total_elements)  # Output: 6 (3 rows * 2 columns)


(2) DataFrame.shape:
* DataFrame.shape returns a tuple representing the dimensions of the DataFrame. The tuple contains two elements: the number of rows and the number of columns.
* It provides the actual structure of the DataFrame in terms of rows and columns.

Example:

In [None]:
import pandas as pd

data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)

num_rows, num_columns = df.shape
print(num_rows, num_columns)  # Output: 3, 2


In summary, DataFrame.size gives you the total count of elements (cells) in the DataFrame, while DataFrame.shape gives you the dimensions of the DataFrame in terms of rows and columns.

Q 6 Which function of pandas do we use to read an excel file?

Ans - In Pandas, you can use the pd.read_excel() function to read data from an Excel file into a DataFrame. This function allows you to read data from Excel files with various formats, including .xls and .xlsx.

Here's how you can use the pd.read_excel() function:

In [None]:
import pandas as pd

# Read an Excel file into a DataFrame
df = pd.read_excel('file_path.xlsx')

# Display the DataFrame
print(df)


In the above code, replace 'file_path.xlsx' with the actual path to your Excel file. The function will read the data from the specified Excel file and create a DataFrame containing the data.

Additionally, the pd.read_excel() function provides various optional parameters that allow you to customize how the data is read, such as specifying the sheet name, skipping rows, selecting columns, handling headers, and more. You can refer to the official Pandas documentation for pd.read_excel() for more details on the available options: pd.read_excel() documentation.






Q 7 You have a Pandas DataFrame df that contains a column named 'Email' that contains email addresses in the format 'username@domain.com'. Write a Python function that creates a new column 'Username' in df that contains only the username part of each email address.

The username is the part of the email address that appears before the '@' symbol. For example, if the email address is 'john.doe@example.com', the 'Username' column should contain 'john.doe'. Your function should extract the username from each email address and store it in the new 'Username' column.

Ans - Certainly! You can achieve this using the Pandas library and the .str accessor for string operations. Here's a Python function that does exactly what you've described:

In [None]:
import pandas as pd

def extract_username(email):
    return email.split('@')[0]

def add_username_column(df):
    df['Username'] = df['Email'].apply(extract_username)
    return df

# Example DataFrame
data = {'Email': ['john.doe@example.com', 'jane.smith@example.com', 'alice.wonderland@example.com']}
df = pd.DataFrame(data)

# Add 'Username' column
df = add_username_column(df)

print(df)


In this example, the extract_username function splits each email address at the '@' symbol and retrieves the part before it (i.e., the username). The add_username_column function applies this extraction to each row in the 'Email' column and creates a new 'Username' column containing the extracted usernames.

Q 8 You have a Pandas DataFrame df with columns 'A','B', and 'C'. Write a Python function that selects all rows where the value in column 'A' is greater than 5 and the value in column 'B' is less than 10. The function should return a new DataFrame that contains only the selected rows.

Ans - Certainly! You can use boolean indexing to filter the DataFrame based on the specified conditions. Here's a Python function that does what you've described:

In [None]:
import pandas as pd

def filter_rows(df):
    selected_rows = df[(df['A'] > 5) & (df['B'] < 10)]
    return selected_rows

# Example DataFrame
data = {'A': [3, 8, 6, 4, 9],
        'B': [12, 5, 8, 3, 7],
        'C': [0.5, 0.8, 0.6, 0.4, 0.9]}

df = pd.DataFrame(data)

# Select rows based on conditions
selected_df = filter_rows(df)

print(selected_df)


In this example, the filter_rows function applies boolean indexing to the DataFrame df. It selects the rows where the value in column 'A' is greater than 5 and the value in column 'B' is less than 10, and then returns a new DataFrame containing only these selected rows.

Q 9 Given a Pandas DataFrame df with a column 'Values', write a Python function to calculate the mean, median, and standard deviation of the values in the 'Values' column.

Ans - Certainly! You can use the Pandas library to easily calculate the mean, median, and standard deviation of the values in the 'Values' column of your DataFrame. Here's a Python function that does this:

In [None]:
import pandas as pd

def calculate_statistics(df):
    mean_value = df['Values'].mean()
    median_value = df['Values'].median()
    std_deviation = df['Values'].std()
    return mean_value, median_value, std_deviation

# Example DataFrame
data = {'Values': [10, 15, 20, 25, 30]}
df = pd.DataFrame(data)

# Calculate statistics
mean, median, std = calculate_statistics(df)

print("Mean:", mean)
print("Median:", median)
print("Standard Deviation:", std)


Replace the data dictionary with your actual data if needed. The calculate_statistics function calculates the mean, median, and standard deviation of the 'Values' column using Pandas' built-in functions. The example DataFrame contains values [10, 15, 20, 25, 30], so the printed output will show the calculated statistics for these values.

Q 10 Given a Pandas DataFrame df with a column 'Sales' and a column 'Date', write a Python function to create a new column 'MovingAverage' that contains the moving average of the sales for the past 7 days for each row in the DataFrame. The moving average should be calculated using a window of size 7 and should include the current day.

Ans - Certainly! You can use the rolling() function provided by Pandas to calculate the moving average for the past 7 days. Here's a Python function that adds a new 'MovingAverage' column to the DataFrame containing the moving average of 'Sales' over a 7-day window:

In [None]:
import pandas as pd

def calculate_moving_average(df, window_size=7):
    df['MovingAverage'] = df['Sales'].rolling(window=window_size, min_periods=1).mean()
    return df

# Example DataFrame
data = {'Date': pd.date_range(start='2023-07-01', periods=15, freq='D'),
        'Sales': [10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80]}
df = pd.DataFrame(data)

# Calculate and add moving average
df = calculate_moving_average(df)

print(df)


In this example, the calculate_moving_average function adds a new 'MovingAverage' column to the DataFrame. The rolling() function is used with a window size of 7 and min_periods=1 to calculate the moving average for the 'Sales' column. The min_periods=1 parameter ensures that even if there are fewer than 7 days of data available for the moving average calculation, it still provides a result. The resulting DataFrame will contain the moving average values for each row, calculated over the past 7 days, including the current day.






Q 11 You have a Pandas DataFrame df with a column 'Date'. Write a Python function that creates a new column 'Weekday' in the DataFrame. The 'Weekday' column should should contain the weekday name (e.g. Monday , Tuesday) corresponding to each date in the 'Date' column.

For example, if df contains the following values:

  Date
  
0 2023-01-01
1 2023-01-02
2 2023-01-03
3 2023-01-04
4 2023-01-05

Your function should create the following DataFrame:

    Date          Weekday
    
0  2023-01-01     Sunday
1  2023-01-02     Monday
2  2023-01-03     Tuesday
3  2023-01-04     Wednesday
4  2023-01-05     Thursday

The function should return the modified DataFrame.

You can achieve this using the Pandas library and the .dt accessor to extract the weekday name from the 'Date' column. Here's a Python function that adds a new 'Weekday' column to the DataFrame with the corresponding weekday names:

In [None]:
import pandas as pd

def add_weekday_column(df):
    df['Weekday'] = df['Date'].dt.strftime('%A')
    return df

# Example DataFrame
data = {'Date': pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05'])}
df = pd.DataFrame(data)

# Add 'Weekday' column
df = add_weekday_column(df)

print(df)


In this example, the add_weekday_column function adds a new 'Weekday' column to the DataFrame. The .dt.strftime('%A') method is used to extract the weekday name from the 'Date' column and format it as the full weekday name (e.g., 'Sunday', 'Monday', etc.). The resulting DataFrame will have the 'Weekday' column containing the corresponding weekday names for each date in the 'Date' column.






Q 12 Given a Pandas DataFrame df with a column 'Date' that contains timestamps, write a Python function to select all rows where the date is between '2023-01-01' and '2023-01-31'

Ans - You can use boolean indexing in Pandas to filter rows based on a date range. Here's a Python function that selects rows from the DataFrame where the date is between '2023-01-01' and '2023-01-31':


In [None]:
import pandas as pd

def filter_date_range(df):
    start_date = '2023-01-01'
    end_date = '2023-01-31'
    mask = (df['Date'] >= start_date) & (df['Date'] <= end_date)
    selected_rows = df[mask]
    return selected_rows

# Example DataFrame
data = {'Date': pd.to_datetime(['2023-01-01', '2023-01-15', '2023-02-05', '2023-01-20', '2023-01-31'])}
df = pd.DataFrame(data)

# Select rows within the date range
selected_df = filter_date_range(df)

print(selected_df)


In this example, the 'filter_date_range' function uses boolean indexing to create a mask that checks if the date is between '2023-01-01' and '2023-01-31'. The resulting mask is then used to select the rows that satisfy this condition, and the function returns a new DataFrame 'selected_df' containing only these selected rows.

Q 13 To use the basic functions of pandas, what is the first and foremost necessary library that needs to be imported?

In [None]:
Ans -