## Q1. List any five functions of the pandas library with execution.

1.read_csv(): This function reads a CSV file and returns a DataFrame.
                
2.head(): This function displays the first n rows of a DataFrame (by default n=5).
                
3.info(): This function provides information about a DataFrame, such as the number of rows and columns, data types, and memory usage.
            
4.groupby(): This function groups the DataFrame by one or more columns and returns a GroupBy object.
                
5.describe(): This function generates descriptive statistics of a DataFrame, such as count, mean, and standard deviation.
               

## Q2. Given a Pandas DataFrame df with columns 'A', 'B', and 'C', write a Python function to re-index the DataFrame with a new index that starts from 1 and increments by 2 for each row.

In [1]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


In [5]:
import pandas as pd
        
def reindex_dataframe(df):
    new_index = pd.RangeIndex(start=1, stop=2*len(df)+1, step=2)
    new_df = df.set_index(new_index)
    return new_df



In [7]:
## To use this function on a DataFrame with columns 'A', 'B', and 'C', you would simply call it like this:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
df = reindex_dataframe(df)
print(df)

   A  B  C
1  1  4  7
3  2  5  8
5  3  6  9


## Q3. You have a Pandas DataFrame df with a column named 'Values'. Write a Python function that iterates over the DataFrame and calculates the sum of the first three values in the 'Values' column. The function should print the sum to the console.
For example, if the 'Values' column of df contains the values [10, 20, 30, 40, 50], your function should
calculate and print the sum of the first three values, which is 60.

In [6]:
def sum_first_three_values(df):
    total = 0
    for i, row in df.iterrows():
        if i < 3:
            total += row['Values']
    print(total)

In [8]:

#This function iterates over the rows of the DataFrame df using the iterrows() method. For each row, it checks if the index i is less than 3  and if so, it adds the value in the 'Values' column for that row to the total variable. Finally, it prints the total to the console.

df = pd.DataFrame({'Values': [10, 20, 30, 40, 50]})
sum_first_three_values(df)

60


## Q4. Given a Pandas DataFrame df with a column 'Text', write a Python function to create a new column 'Word_Count' that contains the number of words in each row of the 'Text' column.

In [9]:
def count_words(df):
    df['Word_Count'] = df['Text'].apply(lambda x: len(x.split()))
    return df

In [10]:

df = pd.DataFrame({'Text': ['This is the first row', 'Second row contains more words', 'Third row with even more words']})
df = count_words(df)
print(df)

                             Text  Word_Count
0           This is the first row           5
1  Second row contains more words           5
2  Third row with even more words           6


## Q5. How are DataFrame.size() and DataFrame.shape() different?

DataFrame.size() returns the total number of elements in the DataFrame, which is equal to the product of the number of rows and the number of columns in the DataFrame.

DataFrame.shape() returns a tuple that contains the number of rows and the number of columns in the DataFrame.

In [11]:
import pandas as pd

# Create a DataFrame with 3 rows and 4 columns
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9], 'D': [10, 11, 12]})

# Get the size of the DataFrame
size = df.size
print(size)  

# Get the shape of the DataFrame
shape = df.shape
print(shape)  

12
(3, 4)


## Q6. Which function of pandas do we use to read an excel file?

In [None]:
import pandas as pd

# Read Excel file into a Pandas DataFrame
df = pd.read_excel('example.xlsx')

# Display the DataFrame
print(df)

In this example, read_excel() function is used to read an Excel file named example.xlsx into a Pandas DataFrame df.

In [None]:
import pandas as pd

# Read data from sheet named 'Sheet1' in the Excel file
df = pd.read_excel('example.xlsx', sheet_name='Sheet1')

# Display the DataFrame
print(df)

#In this example, the read_excel() function reads data from the sheet named 'Sheet1' in the Excel file

## Q7. You have a Pandas DataFrame df that contains a column named 'Email' that contains email addresses in the format 'username@domain.com'. Write a Python function that creates a new column 'Username' in df that contains only the username part of each email address.The username is the part of the email address that appears before the '@' symbol. For example, if the
email address is 'john.doe@example.com', the 'Username' column should contain 'john.doe'. Your
function should extract the username from each email address and store it in the new 'Username'
column.

In [3]:
def extract_username(df):
    df['Username'] = df['Email'].apply(lambda x: x.split('@')[0])
    return df

In [4]:
# Create a sample DataFrame
import pandas as pd
df = pd.DataFrame({'Email': ['john.doe@example.com', 'jane.doe@example.com', 'james.smith@example.com']})

# Call the function to extract the usernames
df = extract_username(df)

# Display the updated DataFrame
print(df)

                     Email     Username
0     john.doe@example.com     john.doe
1     jane.doe@example.com     jane.doe
2  james.smith@example.com  james.smith


This function uses the apply() method with a lambda function to apply the split() method to each row in the 'Email' column. The split() method splits the email address string into two parts based on the '@' character and returns a list with two elements, the first of which is the username. The lambda function then extracts the first element of this list (i.e., the username) and returns it. The resulting values are assigned to a new column 'Username' in the DataFrame using the ['Username'] = syntax.

## Q8. You have a Pandas DataFrame df with columns 'A', 'B', and 'C'. Write a Python function that selects all rows where the value in column 'A' is greater than 5 and the value in column 'B' is less than 10. The function should return a new DataFrame that contains only the selected rows.


For example, if df contains the following values:
A B C
0 3 5 1
1 8 2 7
2 6 9 4
3 2 3 5
4 9 1 2

Your function should select the following rows: A B C
1 8 2 7
4 9 1 2
The function should return a new DataFrame that contains only the selected rows.

In [22]:
import pandas as pd

# Create the DataFrame
df = pd.DataFrame({'A': [3, 8, 6, 2, 9], 'B': [5, 2, 9, 3, 1], 'C': [1, 7, 4, 5, 2]})

# Define the select_rows function
def select_rows(df):
    selected_df = df[(df['A'] > 5) & (df['B'] < 10)]
    return selected_df



In [23]:
# Call the select_rows function
selected_df = select_rows(df)

# Print the selected rows
print(selected_df)

   A  B  C
1  8  2  7
2  6  9  4
4  9  1  2


## Q9. Given a Pandas DataFrame df with a column 'Values', write a Python function to calculate the mean, median, and standard deviation of the values in the 'Values' column.

In [4]:
def calculate_stats(df):
    # Calculate the mean, median, and standard deviation of the 'Values' column
    mean = df['Values'].mean()
    median = df['Values'].median()
    std_dev = df['Values'].std()

    # Print the statistics to the console
    print("Mean:", mean)
    print("Median:", median)
    print("Standard deviation:", std_dev)

In [6]:
# Create a sample DataFrame
import pandas as pd
df = pd.DataFrame({'Values': [10, 20, 30, 40, 50]})
# Call the function to calculate the statistics
calculate_stats(df)

Mean: 30.0
Median: 30.0
Standard deviation: 15.811388300841896


## Q10. Given a Pandas DataFrame df with a column 'Sales' and a column 'Date', write a Python function to create a new column 'MovingAverage' that contains the moving average of the sales for the past 7 days for each row in the DataFrame. The moving average should be calculated using a window of size 7 and should include the current day.

You can use the rolling() method of the Pandas DataFrame to calculate the rolling mean of a column over a given window size. 

In [15]:
import pandas as pd

def calculate_moving_average(df):
    window_size = 7
    ma_column = 'MovingAverage'
    df[ma_column] = df['Sales'].rolling(window_size, min_periods=1).mean()
    return df

In [17]:
# Create a sample DataFrame
df = pd.DataFrame({'Sales': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
                   'Date': pd.date_range(start='2022-01-01', periods=10, freq='D')})

# Call the function to calculate the moving average
df = calculate_moving_average(df)

# Print the modified DataFrame with the new 'MovingAverage' column
print(df)

   Sales       Date  MovingAverage
0     10 2022-01-01           10.0
1     20 2022-01-02           15.0
2     30 2022-01-03           20.0
3     40 2022-01-04           25.0
4     50 2022-01-05           30.0
5     60 2022-01-06           35.0
6     70 2022-01-07           40.0
7     80 2022-01-08           50.0
8     90 2022-01-09           60.0
9    100 2022-01-10           70.0


## Q11. You have a Pandas DataFrame df with a column 'Date'. Write a Python function that creates a new column 'Weekday' in the DataFrame. The 'Weekday' column should contain the weekday name (e.g. Monday, Tuesday) corresponding to each date in the 'Date' column.

For example, if df contains the following values:
Date
0 2023-01-01
1 2023-01-02
2 2023-01-03
3 2023-01-04
4 2023-01-05
Your function should create the following DataFrame:

Date Weekday
0 2023-01-01 Sunday
1 2023-01-02 Monday
2 2023-01-03 Tuesday
3 2023-01-04 Wednesday
4 2023-01-05 Thursday
The function should return the modified DataFrame.

In [19]:
import pandas as pd

def add_weekday_column(df):
    df['Weekday'] = df['Date'].dt.strftime('%A')
    return df

df = pd.DataFrame({
    'Date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05']
})
df['Date'] = pd.to_datetime(df['Date'])
df_with_weekday = add_weekday_column(df)
print(df_with_weekday)

        Date    Weekday
0 2023-01-01     Sunday
1 2023-01-02     Monday
2 2023-01-03    Tuesday
3 2023-01-04  Wednesday
4 2023-01-05   Thursday


## Q12. Given a Pandas DataFrame df with a column 'Date' that contains timestamps, write a Python function to select all rows where the date is between '2023-01-01' and '2023-01-31'.

In [None]:
import pandas as pd

def select_january_data(df):
    start_date = pd.Timestamp('2023-01-01')
    end_date = pd.Timestamp('2023-01-31')
    return df[(df['Date'] >= start_date) & (df['Date'] <= end_date)]

In this function, we define two variables start_date and end_date that represent the start and end dates of the range we want to select. We then use boolean indexing to select the rows where the 'Date' column is between start_date and end_date. The & operator is used to combine two boolean conditions, so that only rows that satisfy both conditions are included in the output. 

## Q13. To use the basic functions of pandas, what is the first and foremost necessary library that needs to be imported?

In [None]:
import pandas as pd