Q1. List any five functions of the pandas library with execution.

Here are five commonly used functions in the pandas library along with sample code snippets:

read_csv() - Reading Data from CSV:

This function is used to read data from a CSV file and create a DataFrame.

In [None]:
import pandas as pd

# Read data from a CSV file into a DataFrame
df = pd.read_csv('example.csv')

# Display the DataFrame
print(df.head())


head() - Displaying the First Rows:

The head() function is used to display the first few rows of a DataFrame.

In [None]:
# Display the first 5 rows of the DataFrame
print(df.head())


describe() - Descriptive Statistics:

describe() generates descriptive statistics, including measures of central tendency, dispersion, and shape of the distribution.

In [None]:
# Display summary statistics for numerical columns
print(df.describe())


groupby() - Grouping and Aggregating Data:

The groupby() function is used to group data based on a column and perform aggregate functions.

In [None]:
# Group by 'Category' and calculate the mean for each group
grouped_data = df.groupby('Category')['Value'].mean()
print(grouped_data)


plot() - Creating Plots:

The plot() function is used for creating various plots directly from a DataFrame.

In [None]:
# Create a line plot of a numerical column
df['Value'].plot(kind='line', title='Line Plot')
plt.show()


Q2. Given a Pandas DataFrame df with columns 'A', 'B', and 'C', write a Python function to re-index the
DataFrame with a new index that starts from 1 and increments by 2 for each row.

You can use the reset_index() function along with a custom index to achieve this. Here's a Python function that re-indexes a DataFrame with a new index starting from 1 and incrementing by 2 for each row:

In [1]:
import pandas as pd

def reindex_dataframe(df):
    # Create a new index starting from 1 and incrementing by 2
    new_index = range(1, 2 * len(df) + 1, 2)

    # Assign the new index to the DataFrame
    df_reindexed = df.reset_index(drop=True)
    df_reindexed.index = new_index

    return df_reindexed

# Example usage:
# Assuming df is your original DataFrame with columns 'A', 'B', 'C'
df = pd.DataFrame({'A': [10, 20, 30],
                   'B': [40, 50, 60],
                   'C': [70, 80, 90]})

# Call the function to re-index the DataFrame
df_reindexed = reindex_dataframe(df)

# Display the re-indexed DataFrame
print(df_reindexed)


    A   B   C
1  10  40  70
3  20  50  80
5  30  60  90


Q3. You have a Pandas DataFrame df with a column named 'Values'. Write a Python function that
iterates over the DataFrame and calculates the sum of the first three values in the 'Values' column. The
function should print the sum to the console.
For example, if the 'Values' column of df contains the values [10, 20, 30, 40, 50], your function should
calculate and print the sum of the first three values, which is 60.

You can create a Python function that iterates over the DataFrame and calculates the sum of the first three values in the 'Values' column. Here's an example:

In [2]:
import pandas as pd

def calculate_sum_of_first_three(df):
    # Check if 'Values' column exists in the DataFrame
    if 'Values' in df.columns:
        # Extract the 'Values' column and calculate the sum of the first three values
        values_column = df['Values'].head(3)
        sum_of_first_three = values_column.sum()

        # Print the result to the console
        print("Sum of the first three values:", sum_of_first_three)
    else:
        print("DataFrame does not contain a 'Values' column.")

# Example usage:
# Assuming df is your DataFrame with a 'Values' column
df = pd.DataFrame({'Values': [10, 20, 30, 40, 50]})

# Call the function to calculate and print the sum of the first three values
calculate_sum_of_first_three(df)


Sum of the first three values: 60


Q4. Given a Pandas DataFrame df with a column 'Text', write a Python function to create a new column
'Word_Count' that contains the number of words in each row of the 'Text' column.

 You can create a Python function that calculates the number of words in each row of the 'Text' column and adds a new column 'Word_Count' to the DataFrame. Here's an example:

python

In [3]:
import pandas as pd

def calculate_word_count(df):
    # Check if 'Text' column exists in the DataFrame
    if 'Text' in df.columns:
        # Calculate the number of words in each row and create a new 'Word_Count' column
        df['Word_Count'] = df['Text'].apply(lambda x: len(str(x).split()))

        # Optionally, you can display the DataFrame with the new column
        print(df)
    else:
        print("DataFrame does not contain a 'Text' column.")

# Example usage:
# Assuming df is your DataFrame with a 'Text' column
df = pd.DataFrame({'Text': ['This is a sample text.', 'Another example.', 'Just a few words.']})

# Call the function to calculate and add the 'Word_Count' column
calculate_word_count(df)


                     Text  Word_Count
0  This is a sample text.           5
1        Another example.           2
2       Just a few words.           4


Q5. How are DataFrame.size() and DataFrame.shape() different?

There's a small mistake in your question. It should be DataFrame.size and DataFrame.shape (without parentheses). Let me clarify the difference between the two:

DataFrame.size:

DataFrame.size returns the total number of elements in the DataFrame.
It is calculated by multiplying the number of rows by the number of columns.
The result includes all elements, regardless of their values (NaN or non-NaN).

In [4]:
import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
size = df.size
print(size)  


6


DataFrame.shape:

DataFrame.shape returns a tuple representing the dimensions of the DataFrame.
The tuple contains two elements: the number of rows and the number of columns.
It provides information about the structure of the DataFrame.

In [5]:
import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
shape = df.shape
print(shape)  


(3, 2)


Q6. Which function of pandas do we use to read an excel file?

In pandas, the function used to read an Excel file is pd.read_excel(). This function is part of the pandas library and is specifically designed to read data from Excel files and create a DataFrame.

In [None]:
import pandas as pd

# Specify the path to the Excel file
excel_file_path = 'path/to/your/excel/file.xlsx'

# Read the Excel file into a DataFrame
df = pd.read_excel(excel_file_path)

# Display the DataFrame
print(df)


Q7. You have a Pandas DataFrame df that contains a column named 'Email' that contains email
addresses in the format 'username@domain.com'. Write a Python function that creates a new column
'Username' in df that contains only the username part of each email address.
The username is the part of the email address that appears before the '@' symbol. For example, if the
email address is 'john.doe@example.com', the 'Username' column should contain 'john.doe'. Your
function should extract the username from each email address and store it in the new 'Username'
column.

In [6]:
import pandas as pd

def extract_username(df):
    # Check if 'Email' column exists in the DataFrame
    if 'Email' in df.columns:
        # Extract the username from each email address and create a new 'Username' column
        df['Username'] = df['Email'].apply(lambda email: email.split('@')[0])

        # Optionally, you can display the DataFrame with the new column
        print(df)
    else:
        print("DataFrame does not contain an 'Email' column.")

# Example usage:
# Assuming df is your DataFrame with an 'Email' column
df = pd.DataFrame({'Email': ['john.doe@example.com', 'alice.smith@example.com', 'bob.jones@example.com']})

# Call the function to extract and add the 'Username' column
extract_username(df)


                     Email     Username
0     john.doe@example.com     john.doe
1  alice.smith@example.com  alice.smith
2    bob.jones@example.com    bob.jones


Q8. You have a Pandas DataFrame df with columns 'A', 'B', and 'C'. Write a Python function that selects
all rows where the value in column 'A' is greater than 5 and the value in column 'B' is less than 10. The
function should return a new DataFrame that contains only the selected rows.
For example, if df contains the following values:
A B C
0 3 5 1
1 8 2 7
2 6 9 4
3 2 3 5
4 9 1 2

Assignment

Data Science Masters

Your function should select the following rows: A B C
1 8 2 7
4 9 1 2
The function should return a new DataFrame that contains only the selected rows.

In [7]:
import pandas as pd

def select_rows(df):
    # Check if columns 'A' and 'B' exist in the DataFrame
    if 'A' in df.columns and 'B' in df.columns:
        # Select rows where 'A' is greater than 5 and 'B' is less than 10
        selected_rows = df[(df['A'] > 5) & (df['B'] < 10)]

        # Optionally, you can display the selected rows
        print(selected_rows)
        
        return selected_rows
    else:
        print("DataFrame does not contain columns 'A' and 'B'.")
        return pd.DataFrame()

# Example usage:
# Assuming df is your DataFrame with columns 'A', 'B', and 'C'
df = pd.DataFrame({'A': [3, 8, 6, 2, 9],
                   'B': [5, 2, 9, 3, 1],
                   'C': [1, 7, 4, 5, 2]})

# Call the function to select and display rows based on the conditions
selected_df = select_rows(df)


   A  B  C
1  8  2  7
2  6  9  4
4  9  1  2


Q9. Given a Pandas DataFrame df with a column 'Values', write a Python function to calculate the mean,
median, and standard deviation of the values in the 'Values' column.

In [8]:
import pandas as pd

def calculate_statistics(df):
    # Check if 'Values' column exists in the DataFrame
    if 'Values' in df.columns:
        # Calculate mean, median, and standard deviation
        mean_value = df['Values'].mean()
        median_value = df['Values'].median()
        std_deviation = df['Values'].std()

        # Display the results
        print("Mean:", mean_value)
        print("Median:", median_value)
        print("Standard Deviation:", std_deviation)

        # Optionally, you can return the calculated statistics
        return mean_value, median_value, std_deviation
    else:
        print("DataFrame does not contain a 'Values' column.")
        return None

# Example usage:
# Assuming df is your DataFrame with a 'Values' column
df = pd.DataFrame({'Values': [10, 20, 30, 40, 50]})

# Call the function to calculate and display statistics
calculate_statistics(df)


Mean: 30.0
Median: 30.0
Standard Deviation: 15.811388300841896


(30.0, 30.0, 15.811388300841896)

Q10. Given a Pandas DataFrame df with a column 'Sales' and a column 'Date', write a Python function to
create a new column 'MovingAverage' that contains the moving average of the sales for the past 7 days
for each row in the DataFrame. The moving average should be calculated using a window of size 7 and
should include the current day.

In [9]:
import pandas as pd

def add_moving_average(df):
    # Sort DataFrame by 'Date' column if not already sorted
    df = df.sort_values(by='Date')

    # Calculate moving average using a window of size 7
    df['MovingAverage'] = df['Sales'].rolling(window=7, min_periods=1).mean()

    return df

# Example usage:
# Assuming df is your DataFrame with 'Sales' and 'Date' columns
# df = pd.DataFrame({'Date': ['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04', '2024-01-05'],
#                    'Sales': [10, 20, 30, 40, 50]})
# df['Date'] = pd.to_datetime(df['Date'])
# df = add_moving_average(df)
# print(df)


Q11. You have a Pandas DataFrame df with a column 'Date'. Write a Python function that creates a new
column 'Weekday' in the DataFrame. The 'Weekday' column should contain the weekday name (e.g.
Monday, Tuesday) corresponding to each date in the 'Date' column.
For example, if df contains the following values:
Date
0 2023-01-01
1 2023-01-02
2 2023-01-03
3 2023-01-04
4 2023-01-05
Your function should create the following DataFrame:

Date Weekday
0 2023-01-01 Sunday
1 2023-01-02 Monday
2 2023-01-03 Tuesday
3 2023-01-04 Wednesday
4 2023-01-05 Thursday
The function should return the modified DataFrame.

In [10]:
import pandas as pd

def add_weekday_column(df):
    # Convert 'Date' column to datetime if it's not already
    df['Date'] = pd.to_datetime(df['Date'])

    # Add a new 'Weekday' column with the weekday names
    df['Weekday'] = df['Date'].dt.day_name()

    return df

# Example usage:
# Assuming df is your DataFrame with 'Date' column
# df = pd.DataFrame({'Date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05']})
# df = add_weekday_column(df)
# print(df)


Q12. Given a Pandas DataFrame df with a column 'Date' that contains timestamps, write a Python
function to select all rows where the date is between '2023-01-01' and '2023-01-31'.

In [11]:
import pandas as pd

def select_rows_between_dates(df):
    # Assuming 'Date' column is in datetime format, if not, you can convert it using pd.to_datetime(df['Date'])
    df['Date'] = pd.to_datetime(df['Date'])
    
    # Define the start and end dates
    start_date = '2023-01-01'
    end_date = '2023-01-31'
    
    # Use boolean indexing to select rows between the specified dates
    selected_rows = df[(df['Date'] >= start_date) & (df['Date'] <= end_date)]
    
    return selected_rows

# Example usage:
# Assuming df is your DataFrame with a 'Date' column
# selected_rows = select_rows_between_dates(df)


Q13. To use the basic functions of pandas, what is the first and foremost necessary library that needs to
be imported?