# Assignment

## Q1. List any five functions of the pandas library with execution.

Ans: read_csv(): This function is used to read a CSV (Comma-Separated Values) file and create a DataFrame from it.

head(): This function is used to display the first few rows of a DataFrame.

groupby(): This function is used to group rows of a DataFrame based on one or more columns and perform aggregate operations on the grouped data.

merge(): This function is used to combine multiple DataFrames based on common columns or indices.

fillna(): This function is used to fill missing values in a DataFrame with a specified value or method.

## Q2. Given a Pandas DataFrame df with columns 'A', 'B', and 'C', write a Python function to re-index the DataFrame with a new index that starts from 1 and increments by 2 for each row.

In [1]:
import pandas as pd

def reindex_dataframe(df):
    df = df.reset_index(drop=True)  # Reset the existing index

    # Create a new index using lambda function
    df.index = df.index.map(lambda x: x * 2 + 1)

    return df

data = {'A': [10, 20, 30, 40],
        'B': [50, 60, 70, 80],
        'C': [90, 100, 110, 120]}

df = pd.DataFrame(data)

# Re-index the DataFrame using the custom function
df_reindexed = reindex_dataframe(df)

print(df_reindexed)


    A   B    C
1  10  50   90
3  20  60  100
5  30  70  110
7  40  80  120


## Q3 You have a Pandas DataFrame df with a column named 'Values'. Write a Python function thatiterates over the DataFrame and calculates the sum of the first three values in the 'Values' column. The function should print the sum to the console.

In [1]:
import pandas as pd

def calculate_sum_of_first_three(df):
    sum_of_first_three = 0

    # Iterate over the DataFrame rows
    for index, row in df.iterrows():
        if index < 3:
            sum_of_first_three += row['Values']

    print("Sum of the first three values:", sum_of_first_three)

    
# sample DataFrame
data = {'Values': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)

# Calculate the sum of the first three values using the custom function
calculate_sum_of_first_three(df)
    

Sum of the first three values: 60


## Q4. Given a Pandas DataFrame df with a column 'Text', write a Python function to create a new column 'Word_Count' that contains the number of words in each row of the 'Text' column.

In [2]:
import pandas as pd

def count_words(df):
    # Create a new column 'Word_Count' by applying a lambda function
    df['Word_Count'] = df['Text'].apply(lambda x: len(str(x).split()))

    return df

# sample DataFrame
data = {'Text': ['Hello, how are you?', 'I am doing well.', 'Python is great!']}
df = pd.DataFrame(data)

df_with_word_count = count_words(df)

# Print the DataFrame with the new column
print(df_with_word_count)


                  Text  Word_Count
0  Hello, how are you?           4
1     I am doing well.           4
2     Python is great!           3


## Q5. How are DataFrame.size() and DataFrame.shape() different?

The DataFrame.size and DataFrame.shape are both attributes in pandas DataFrame, but they provide different information about the DataFrame.

DataFrame.size: This attribute returns the total number of elements in the DataFrame, which is equivalent to the number of rows multiplied by the number of columns. It represents the overall size or total number of cells in the DataFrame.

DataFrame.shape: This attribute returns a tuple representing the dimensions of the DataFrame. It provides the number of rows and columns in the DataFrame. The shape attribute returns the tuple (rows, columns).

In [3]:
import pandas as pd

data = {'A': [1, 2, 3],
        'B': [4, 5, 6],
        'C': [7, 8, 9]}

df = pd.DataFrame(data)

print(df.size)  # Output: 9 (3 rows * 3 columns)

print(df.shape)  # Output: (3, 3) (3 rows, 3 columns)


9
(3, 3)


## Q6. Which function of pandas do we use to read an excel file?
Ans: To read an Excel file in pandas, we can use the read_excel() function.

## Q7. You have a Pandas DataFrame df that contains a column named 'Email' that contains email addresses in the format 'username@domain.com'. Write a Python function that creates a new column 'Username' in df that contains only the username part of each email address.

In [1]:
import pandas as pd

def extract_username(df):

    df['Username'] = df['Email'].str.split('@').str[0]
    
    return df

df = pd.DataFrame({'Email': ['john.doe@example.com', 'jane.smith@example.com', 'bob.miller@example.com']})

df = extract_username(df)

print(df)


                    Email    Username
0    john.doe@example.com    john.doe
1  jane.smith@example.com  jane.smith
2  bob.miller@example.com  bob.miller


## Q8. You have a Pandas DataFrame df with columns 'A', 'B', and 'C'. Write a Python function that selects all rows where the value in column 'A' is greater than 5 and the value in column 'B' is less than 10. The function should return a new DataFrame that contains only the selected rows.

In [2]:
import pandas as pd

def select_rows(df):
    selected_rows = df[(df['A'] > 5) & (df['B'] < 10)]
    return selected_rows

df = pd.DataFrame({'A': [3, 8, 6, 2, 9],
                   'B': [5, 2, 9, 3, 1],
                   'C': [1, 7, 4, 5, 2]})

selected_df = select_rows(df)

print(selected_df)


   A  B  C
1  8  2  7
2  6  9  4
4  9  1  2


## Q9. Given a Pandas DataFrame df with a column 'Values', write a Python function to calculate the mean,median, and standard deviation of the values in the 'Values' column.

In [3]:
import pandas as pd

def calculate_statistics(df):
    mean_value = df['Values'].mean()
    median_value = df['Values'].median()
    std_value = df['Values'].std()
    
    return mean_value, median_value, std_value

df = pd.DataFrame({'Values': [5, 10, 15, 20, 25]})

mean, median, std = calculate_statistics(df)

print("Mean:", mean)
print("Median:", median)
print("Standard Deviation:", std)

Mean: 15.0
Median: 15.0
Standard Deviation: 7.905694150420948


## Q10. Given a Pandas DataFrame df with a column 'Sales' and a column 'Date', write a Python function to create a new column 'MovingAverage' that contains the moving average of the sales for the past 7 days for each row in the DataFrame. The moving average should be calculated using a window of size 7 and should include the current day.

In [4]:
import pandas as pd

def calculate_moving_average(df):
    window_size = 7
    df['MovingAverage'] = df['Sales'].rolling(window=window_size, min_periods=1).mean()
    return df

df = pd.DataFrame({'Date': pd.date_range(start='2023-01-01', periods=10),
                   'Sales': [10, 12, 8, 15, 20, 18, 13, 9, 11, 16]})

df = calculate_moving_average(df)

print(df)


        Date  Sales  MovingAverage
0 2023-01-01     10      10.000000
1 2023-01-02     12      11.000000
2 2023-01-03      8      10.000000
3 2023-01-04     15      11.250000
4 2023-01-05     20      13.000000
5 2023-01-06     18      13.833333
6 2023-01-07     13      13.714286
7 2023-01-08      9      13.571429
8 2023-01-09     11      13.428571
9 2023-01-10     16      14.571429


## Q11. You have a Pandas DataFrame df with a column 'Date'. Write a Python function that creates a new column 'Weekday' in the DataFrame. The 'Weekday' column should contain the weekday name (e.g. Monday, Tuesday) corresponding to each date in the 'Date' column.

In [5]:
import pandas as pd

def add_weekday_column(df):
    df['Weekday'] = df['Date'].dt.day_name()
    return df

df = pd.DataFrame({'Date': pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05'])})

df = add_weekday_column(df)

print(df)


        Date    Weekday
0 2023-01-01     Sunday
1 2023-01-02     Monday
2 2023-01-03    Tuesday
3 2023-01-04  Wednesday
4 2023-01-05   Thursday


## Q12. Given a Pandas DataFrame df with a column 'Date' that contains timestamps, write a Python function to select all rows where the date is between '2023-01-01' and '2023-01-31'.

In [6]:
import pandas as pd

def select_rows_by_date(df):
    start_date = '2023-01-01'
    end_date = '2023-01-31'
    
    df['Date'] = pd.to_datetime(df['Date'])
    
    selected_rows = df[(df['Date'] >= start_date) & (df['Date'] <= end_date)]
    
    return selected_rows

df = pd.DataFrame({'Date': ['2023-01-01', '2023-01-15', '2023-01-31', '2023-02-10']})

selected_df = select_rows_by_date(df)

print(selected_df)


        Date
0 2023-01-01
1 2023-01-15
2 2023-01-31


## Q13. To use the basic functions of pandas, what is the first and foremost necessary library that needs to be imported?

Ans: To use the basic functions of Pandas, the first and foremost library that needs to be imported is the Pandas library itself.