#  PANDAS ADVANCE ASSIGNMENT

**Q1. List any five functions of the pandas library with execution.**

Solution:

Here are five commonly used functions of the Pandas library along with their execution examples:

**1)read_csv()**: This function is used to read data from a CSV file and create a DataFrame.
Example:

In [11]:
import pandas as pd

df = pd.read_csv("C:/Users/Bharath/Downloads/taxonomy.csv.xls")  # Read data from 'data.csv' and create a DataFrame




**2)info()**: This function provides a concise summary of a DataFrame, including column names, data types, and non-null value counts.
Example:

In [12]:

df.info()  # Display summary information of the DataFrame


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 290 entries, 0 to 289
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   taxonomy_id  290 non-null    object
 1   name         290 non-null    object
 2   parent_id    279 non-null    object
 3   parent_name  279 non-null    object
dtypes: object(4)
memory usage: 9.2+ KB


**3)head()**: This function is used to view the first few rows of a DataFrame.
Example:

In [13]:
df.head()  # Display the first 5 rows of the DataFrame


Unnamed: 0,taxonomy_id,name,parent_id,parent_name
0,101,Emergency,,
1,101-01,Disaster Response,101,Emergency
2,101-02,Emergency Cash,101,Emergency
3,101-02-01,Help Pay for Food,101-02,Emergency Cash
4,101-02-02,Help Pay for Healthcare,101-02,Emergency Cash


**4)describe()**: This function generates descriptive statistics of numerical columns in a DataFrame, such as count, mean, standard deviation, minimum value, and quartiles.
Example:

In [14]:

df.describe() # Display the descriptive statistics of the DataFrame


Unnamed: 0,taxonomy_id,name,parent_id,parent_name
count,290,290,279,279
unique,290,183,60,50
top,101,Nursing Home,106-06-07,Health Education
freq,1,4,11,15


**5)tail()**: This function is used to view last few rows of a DataFrame. Example:

In [15]:
df.tail()  # Display the last 5 rows of the DataFrame


Unnamed: 0,taxonomy_id,name,parent_id,parent_name
285,111-01-07,Workplace Rights,111-01,Advocacy & Legal Aid
286,111-02,Mediation,111,Legal
287,111-03,Notary,111,Legal
288,111-04,Representation,111,Legal
289,111-05,Translation & Interpretation,111,Legal


**Q2. Given a Pandas DataFrame df with columns 'A', 'B', and 'C', write a Python function to re-index the
DataFrame with a new index that starts from 1 and increments by 2 for each row.**

Solution:

To re-index a Pandas DataFrame with a new index that starts from 1 and increments by 2 for each row, you can use the reset_index() function along with a lambda function to 

In [16]:
import pandas as pd

def reindex_dataframe(df):
    new_index = lambda x: (x * 2) + 1
    df.reset_index(drop=True, inplace=True)
    df.index = df.index.map(new_index)
    return df


In this function, reset_index() is used to reset the index of the DataFrame, and drop=True ensures that the old index is not added as a new column in the DataFrame. Then, map() is applied to the index values using a lambda function new_index that generates the new index values based on the original index values.

In [17]:
import pandas as pd

# Example DataFrame
df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [5, 6, 7, 8], 'C': [9, 10, 11, 12]})

# Re-index the DataFrame
df_reindexed = reindex_dataframe(df)
print(df_reindexed)


   A  B   C
1  1  5   9
3  2  6  10
5  3  7  11
7  4  8  12


In the resulting DataFrame, the index starts from 1 and increments by 2 for each row, as desired.

**Q3. You have a Pandas DataFrame df with a column named 'Values'. Write a Python function that
iterates over the DataFrame and calculates the sum of the first three values in the 'Values' column. The
function should print the sum to the console.
For example, if the 'Values' column of df contains the values [10, 20, 30, 40, 50], your function should
calculate and print the sum of the first three values, which is 60.**

Solution:

In [18]:
import pandas as pd

def calculate_sum_of_first_three(df):
    values_column = df['Values']
    first_three_values = values_column[:3]  # Select the first three values
    sum_of_first_three = sum(first_three_values)
    print("Sum of the first three values:", sum_of_first_three)


In this function, the 'Values' column is extracted from the DataFrame using df['Values']. Then, the first three values are selected using slicing values_column[:3]. The sum() function is used to calculate the sum of the selected values.

In [19]:
import pandas as pd

# Example DataFrame
df = pd.DataFrame({'Values': [10, 20, 30, 40, 50]})

# Calculate and print the sum of the first three values
calculate_sum_of_first_three(df)


Sum of the first three values: 60


**Q4. Given a Pandas DataFrame df with a column 'Text', write a Python function to create a new column
'Word_Count' that contains the number of words in each row of the 'Text' column.**

Solution:

In [20]:
import pandas as pd

def add_word_count_column(df):
    df['Word_Count'] = df['Text'].apply(lambda x: len(str(x).split()))
    return df


In this function, apply() is used to apply a lambda function to each row of the 'Text' column. The lambda function lambda x: len(str(x).split()) converts each value in the 'Text' column to a string, splits it into words using the split() method, and returns the count of words using len().

In [23]:
import pandas as pd

# Example DataFrame
df = pd.DataFrame({'Text': ['Hello, how are you?', 'I am doing great!', 'Python is awesome.']})

df


Unnamed: 0,Text
0,"Hello, how are you?"
1,I am doing great!
2,Python is awesome.


In [24]:
# Add the 'Word_Count' column
df_with_word_count = add_word_count_column(df)
print(df_with_word_count)

                  Text  Word_Count
0  Hello, how are you?           4
1    I am doing great!           4
2   Python is awesome.           3


**Q5. How are DataFrame.size() and DataFrame.shape() different?**

Solution:

The functions `DataFrame.size()` and `DataFrame.shape()` in Pandas provide different information about the DataFrame.

- `DataFrame.size()`: This function returns the total number of elements in the DataFrame. It calculates the size by multiplying the number of rows by the number of columns.

Example:
```python
import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
print(df.size)  # Output: 6
```

In this example, the DataFrame `df` has 3 rows and 2 columns, so the total number of elements is 3 * 2 = 6. Thus, `df.size` will output 6.

- `DataFrame.shape()`: This function returns a tuple that represents the dimensions of the DataFrame. It provides information about the number of rows and columns in the DataFrame.

Example:
```python
import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
print(df.shape)  # Output: (3, 2)
```

In this example, the DataFrame `df` has 3 rows and 2 columns. Therefore, `df.shape` will output (3, 2), indicating that the DataFrame has 3 rows and 2 columns.

In summary, `DataFrame.size()` returns the total number of elements in the DataFrame, while `DataFrame.shape()` returns a tuple representing the dimensions (number of rows and columns) of the DataFrame.




**Q6. Which function of pandas do we use to read an excel file?**


Solution:



To read an Excel file in Pandas, you can use the `read_excel()` function. This function allows you to read data from an Excel file and create a DataFrame.

Example:
```python
import pandas as pd

df = pd.read_excel('data.xlsx')  # Read data from 'data.xlsx' and create a DataFrame
print(df.head())
```

In this example, `read_excel()` is used to read the data from the 'data.xlsx' Excel file and create a DataFrame. The resulting DataFrame is then printed using `df.head()` to display the first few rows of the DataFrame.

Note: The `read_excel()` function requires the `openpyxl` library to be installed. You can install it using `pip install openpyxl` if it is not already installed.

**Q7. You have a Pandas DataFrame df that contains a column named 'Email' that contains email
addresses in the format 'username@domain.com'. Write a Python function that creates a new column
'Username' in df that contains only the username part of each email address.
The username is the part of the email address that appears before the '@' symbol. For example, if the
email address is 'john.doe@example.com', the 'Username' column should contain 'john.doe'. Your
function should extract the username from each email address and store it in the new 'Username'
column.**

Solution:

In [25]:
import pandas as pd

def extract_username(df):
    df['Username'] = df['Email'].str.split('@').str.get(0)
    return df


In this function, str.split('@') is used to split each email address in the 'Email' column at the '@' symbol, creating a list of strings. Then, str.get(0) is used to extract the first element (the username) from the list. This extracted username is assigned to the new 'Username' column of the DataFrame.

In [26]:
import pandas as pd

# Example DataFrame
df = pd.DataFrame({'Email': ['john.doe@example.com', 'jane.smith@example.com', 'bob@gmail.com']})

df

Unnamed: 0,Email
0,john.doe@example.com
1,jane.smith@example.com
2,bob@gmail.com


In [27]:
# Extract the username and add the 'Username' column
df_with_username = extract_username(df)
print(df_with_username)


                    Email    Username
0    john.doe@example.com    john.doe
1  jane.smith@example.com  jane.smith
2           bob@gmail.com         bob


**Q8. You have a Pandas DataFrame df with columns 'A', 'B', and 'C'. Write a Python function that selects
all rows where the value in column 'A' is greater than 5 and the value in column 'B' is less than 10. The
function should return a new DataFrame that contains only the selected rows.
For example, if df contains the following values:
A B C
0 3 5 1
1 8 2 7
2 6 9 4
3 2 3 5
4 9 1 2
Your function should select the following rows: A B C
1 8 2 7
4 9 1 2
The function should return a new DataFrame that contains only the selected rows.**

Solution:

In [28]:
import pandas as pd

def select_rows(df):
    selected_rows = df[(df['A'] > 5) & (df['B'] < 10)]
    return selected_rows


In this function, (df['A'] > 5) & (df['B'] < 10) creates a boolean mask that checks if the value in column 'A' is greater than 5 and the value in column 'B' is less than 10. This mask is used to filter the DataFrame df, selecting only the rows that satisfy the conditions.

In [29]:
import pandas as pd

# Example DataFrame
df = pd.DataFrame({'A': [3, 8, 6, 2, 9], 'B': [5, 2, 9, 3, 1], 'C': [1, 7, 4, 5, 2]})

df

Unnamed: 0,A,B,C
0,3,5,1
1,8,2,7
2,6,9,4
3,2,3,5
4,9,1,2


In [30]:
# Select rows based on conditions
selected_df = select_rows(df)
print(selected_df)


   A  B  C
1  8  2  7
2  6  9  4
4  9  1  2


**Q9. Given a Pandas DataFrame df with a column 'Values', write a Python function to calculate the mean,
median, and standard deviation of the values in the 'Values' column.**

Solution:

In [31]:
import pandas as pd

def calculate_statistics(df):
    values_column = df['Values']
    mean_value = values_column.mean()
    median_value = values_column.median()
    std_value = values_column.std()
    
    return mean_value, median_value, std_value


In this function, df['Values'] is used to extract the 'Values' column from the DataFrame. Then, the mean(), median(), and std() functions are applied to the 'Values' column to calculate the mean, median, and standard deviation, respectively.

In [32]:
import pandas as pd

# Example DataFrame
df = pd.DataFrame({'Values': [10, 20, 30, 40, 50]})

df


Unnamed: 0,Values
0,10
1,20
2,30
3,40
4,50


In [33]:
# Calculate statistics
mean, median, std = calculate_statistics(df)

print("Mean:", mean)
print("Median:", median)
print("Standard Deviation:", std)

Mean: 30.0
Median: 30.0
Standard Deviation: 15.811388300841896


**Q10. Given a Pandas DataFrame df with a column 'Sales' and a column 'Date', write a Python function to
create a new column 'MovingAverage' that contains the moving average of the sales for the past 7 days
for each row in the DataFrame. The moving average should be calculated using a window of size 7 and
should include the current day.**

Solution:

In [34]:
import pandas as pd

def calculate_moving_average(df):
    window_size = 7
    df['MovingAverage'] = df['Sales'].rolling(window=window_size, min_periods=1).mean()
    return df


In this function, df['Sales'].rolling(window=window_size, min_periods=1) creates a rolling window of size 7 for the 'Sales' column, where window_size is the desired window size. The min_periods=1 argument ensures that the rolling window starts from the first available value.

In [35]:
import pandas as pd

# Example DataFrame
df = pd.DataFrame({
    'Date': pd.date_range('2023-01-01', periods=10),
    'Sales': [10, 15, 20, 25, 30, 35, 40, 45, 50, 55]
})

df


Unnamed: 0,Date,Sales
0,2023-01-01,10
1,2023-01-02,15
2,2023-01-03,20
3,2023-01-04,25
4,2023-01-05,30
5,2023-01-06,35
6,2023-01-07,40
7,2023-01-08,45
8,2023-01-09,50
9,2023-01-10,55


In [36]:
# Calculate moving average
df_with_ma = calculate_moving_average(df)
print(df_with_ma)

        Date  Sales  MovingAverage
0 2023-01-01     10           10.0
1 2023-01-02     15           12.5
2 2023-01-03     20           15.0
3 2023-01-04     25           17.5
4 2023-01-05     30           20.0
5 2023-01-06     35           22.5
6 2023-01-07     40           25.0
7 2023-01-08     45           30.0
8 2023-01-09     50           35.0
9 2023-01-10     55           40.0


**Q11. You have a Pandas DataFrame df with a column 'Date'. Write a Python function that creates a new
column 'Weekday' in the DataFrame. The 'Weekday' column should contain the weekday name (e.g.
Monday, Tuesday) corresponding to each date in the 'Date' column.
For example, if df contains the following values:
Date
0 2023-01-01
1 2023-01-02
2 2023-01-03
3 2023-01-04
4 2023-01-05
Your function should create the following DataFrame:**

**Date Weekday
0 2023-01-01 Sunday
1 2023-01-02 Monday
2 2023-01-03 Tuesday
3 2023-01-04 Wednesday
4 2023-01-05 Thursday**

**The function should return the modified DataFrame.**

Solution:


In [37]:
import pandas as pd

def add_weekday_column(df):
    df['Weekday'] = df['Date'].dt.strftime('%A')
    return df


In this function, df['Date'].dt.strftime('%A') extracts the weekday name from each date in the 'Date' column using the %A format code. This format code represents the full weekday name.

In [38]:
import pandas as pd

# Example DataFrame
df = pd.DataFrame({'Date': pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05'])})

df


Unnamed: 0,Date
0,2023-01-01
1,2023-01-02
2,2023-01-03
3,2023-01-04
4,2023-01-05


In [39]:
# Add weekday column
df_with_weekday = add_weekday_column(df)
print(df_with_weekday)

        Date    Weekday
0 2023-01-01     Sunday
1 2023-01-02     Monday
2 2023-01-03    Tuesday
3 2023-01-04  Wednesday
4 2023-01-05   Thursday


**Q12. Given a Pandas DataFrame df with a column 'Date' that contains timestamps, write a Python
function to select all rows where the date is between '2023-01-01' and '2023-01-31'.**

Solution:

In [40]:
import pandas as pd

def select_rows_in_date_range(df):
    start_date = '2023-01-01'
    end_date = '2023-01-31'
    mask = (df['Date'].between(start_date, end_date))
    selected_rows = df[mask]
    return selected_rows


In this function, df['Date'].between(start_date, end_date) creates a boolean mask indicating whether each date in the 'Date' column is between the start date and end date.

The resulting mask is then used to filter the DataFrame df, selecting only the rows that fall within the specified date range.

In [41]:
import pandas as pd

# Example DataFrame
df = pd.DataFrame({'Date': pd.date_range('2023-01-01', periods=10)})

df


Unnamed: 0,Date
0,2023-01-01
1,2023-01-02
2,2023-01-03
3,2023-01-04
4,2023-01-05
5,2023-01-06
6,2023-01-07
7,2023-01-08
8,2023-01-09
9,2023-01-10


In [42]:
# Select rows in date range
selected_df = select_rows_in_date_range(df)
print(selected_df)

        Date
0 2023-01-01
1 2023-01-02
2 2023-01-03
3 2023-01-04
4 2023-01-05
5 2023-01-06
6 2023-01-07
7 2023-01-08
8 2023-01-09
9 2023-01-10


**Q13. To use the basic functions of pandas, what is the first and foremost necessary library that needs to
be imported?**

Solution:

The first and foremost necessary library that needs to be imported to use the basic functions of pandas is the pandas library itself. The import statement for pandas is typically written as:

```python
import pandas as pd
```

By importing pandas as `pd`, it is a common convention to use the `pd` alias when referring to pandas functions and objects throughout the code. This allows for a more concise and readable syntax when working with pandas.

Once the pandas library is imported, you can then use various functions and objects provided by pandas for data manipulation and analysis.

# ------------------------------------------------------------END---------------------------------------------