##  Series and DataFrame: Understand the fundamental data structures in Pandas

### Exercise 1: Series

Create a Pandas Series to represent the accuracy scores of a machine learning model for different iterations. Find the maximum and minimum accuracy scores.

In [1]:
import pandas as pd

# Data represents accuracy scores
data = [0.82, 0.76, 0.89, 0.93, 0.81]

# TODO: Create a Series to represent accuracy scores from data
accuracy_scores = pd.Series(data)

# TODO: Find the maximum accuracy score
max_accuracy = accuracy_scores.max()

# TODO: Find the minimum accuracy score
min_accuracy = accuracy_scores.min()

# Print the maximum and minimum accuracy scores
print("Maximum Accuracy:", max_accuracy)
print("Minimum Accuracy:", min_accuracy)

Maximum Accuracy: 0.93
Minimum Accuracy: 0.76


### Exercise 2: DataFrame

Create a Pandas DataFrame to store information about the performance metrics (precision, recall, and F1-score) of a classification model for different classes. Calculate the average precision across all classes.

In [2]:
# The data represents performance metrics
data = {'Class': ['Class A', 'Class B', 'Class C'],
        'Precision': [0.82, 0.76, 0.89],
        'Recall': [0.72, 0.84, 0.91],
        'F1-score': [0.76, 0.80, 0.87]}

# TODO: Create a DataFrame to represent performance metrics
performance_df = pd.DataFrame(data)

# TODO: Calculate the average precision across all classes using the .groupby() and .mean() methods
average_precision = performance_df.groupby('Class')['Precision'].mean()

# Print the average precision
print("Average Precision:", average_precision)

Average Precision: Class
Class A    0.82
Class B    0.76
Class C    0.89
Name: Precision, dtype: float64


### Exercise 3: DataFrame

Create a Pandas DataFrame to represent the sales data of different products for different regions. Calculate the total sales for each region and find the region with the highest total sales.

In [3]:
# The data represents some sales information
data = {'Region': ['Region A', 'Region B', 'Region C', 'Region A', 'Region C'],
        'Product': ['Product 1', 'Product 2', 'Product 1', 'Product 2', 'Product 3'],
        'Sales': [100, 200, 150, 120, 180]}

# TODO: Create a DataFrame to represent sales data
sales_df = pd.DataFrame(data)

# TODO: Calculate the total sales for each region by using the .groupby() and .sum() methods
total_sales = sales_df.groupby('Region')['Sales'].sum()

# TODO: Find the region with the highest total sales using .idmax() method
highest_sales_region = total_sales.idxmax()

# Print the total sales for each region and the region with the highest sales
print("Total Sales by Region:")
print(total_sales)
print("Region with Highest Sales:", highest_sales_region)

Total Sales by Region:
Region
Region A    220
Region B    200
Region C    330
Name: Sales, dtype: int64
Region with Highest Sales: Region C


### Exercise 4: Exploring Data

Given a synthetic DataFrame `data`, explore the data by displaying the first few rows, summary statistics, and column information.

In [4]:
# Synthetic DataFrame
data = pd.DataFrame({
    'Feature1': [5, 7, 6, 3, 1, 2, 8, 6, 7, 5, 5, 8, 1, 2, 3, 4, 5, 9, 1, 3],
    'Feature2': ['E', 'A', 'C', 'G', 'K', 'E', 'A', 'E', 'G', 'L', 'E', 'A', 'D', 'G', 'K', 'F', 'A', 'D', 'E', 'K'],
    'Feature3': [True, False, True, True, False, True, False, True, False, True, False, True, True, False, True, True, False, True, False, True]
})

# TODO: Display the first few rows of the data. Use the head() method
print("First few rows of the data:")
print(data.head())

# TODO: Display the 11th and 12th rows of the data
print("11th and 12th rows of the data:")
print(data.iloc[10:12])

# TODO: Display summary statistics of the data. Use the describe() method
print("Summary statistics of the data:")
print(data.describe())

# TODO: Display column information. Use the info() method
print("Column information:")
print(data.info())

First few rows of the data:
   Feature1 Feature2  Feature3
0         5        E      True
1         7        A     False
2         6        C      True
3         3        G      True
4         1        K     False
11th and 12th rows of the data:
    Feature1 Feature2  Feature3
10         5        E     False
11         8        A      True
Summary statistics of the data:
        Feature1
count  20.000000
mean    4.550000
std     2.502104
min     1.000000
25%     2.750000
50%     5.000000
75%     6.250000
max     9.000000
Column information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Feature1  20 non-null     int64 
 1   Feature2  20 non-null     object
 2   Feature3  20 non-null     bool  
dtypes: bool(1), int64(1), object(1)
memory usage: 468.0+ bytes
None


### Exercise 5: Sorting Data

Given a synthetic DataFrame `data`, sort the data based on a specific column in ascending order.

In [5]:
# TODO: Sort the data based on 'Feature1' column in ascending order. Use the sort_values() method
sorted_data1 = data.sort_values(by='Feature1', ascending=True)

# TODO: Sort the data based on 'Feature2' column in descending order. Use the sort_values() method
sorted_data2 = data.sort_values(by='Feature2', ascending=False)

# Display the sorted data
print("Sorted Data based on 'Feature1':")
print(sorted_data1)

print("Sorted Data based on 'Feature2':")
print(sorted_data2)

Sorted Data based on 'Feature1':
    Feature1 Feature2  Feature3
4          1        K     False
12         1        D      True
18         1        E     False
5          2        E      True
13         2        G     False
19         3        K      True
3          3        G      True
14         3        K      True
15         4        F      True
16         5        A     False
0          5        E      True
10         5        E     False
9          5        L      True
7          6        E      True
2          6        C      True
8          7        G     False
1          7        A     False
11         8        A      True
6          8        A     False
17         9        D      True
Sorted Data based on 'Feature2':
    Feature1 Feature2  Feature3
9          5        L      True
19         3        K      True
4          1        K     False
14         3        K      True
3          3        G      True
13         2        G     False
8          7        G     False
15    

## Data Manipulation: Learn how to manipulate and transform data using Pandas functions

### Exercise 6: Filtering data

Create a synthetic DataFrame `research_data` containing information about AI research articles. Filter the data to select articles published in a specific year.

In [6]:
# Synthetic DataFrame with AI research data
research_data = pd.DataFrame({
    'Title': ['AI in Healthcare', 'Deep Learning Techniques', 'Natural Language Processing', 'Computer Vision', 'Robots Taking Over the World', 'The Quantum Computing Revolution'],
    'Year': [2020, 2021, 2019, 2022, 2018, 2023],
    'Authors': ['Ahmed Al-Farabi', 'Fatima Khalid', 'Mohammed Ali', 'Noura Abdullah', 'Abdul Rahman', 'Sara Ahmed']
})

# TODO: Filter the data to select articles published before 2021
filtered_data1 = research_data[research_data['Year'] < 2021]

# TODO: Filter the data to select articles published by 'Mohammed Ali'
filtered_data2 = research_data[research_data['Authors'] == 'Mohammed Ali']

# Print the filtered data
print("Filtered Data (Articles published before 2021):")
print(filtered_data1)

print("Filtered Data (Articles published by 'Mohammed Ali'):")
print(filtered_data2)

Filtered Data (Articles published before 2021):
                          Title  Year          Authors
0              AI in Healthcare  2020  Ahmed Al-Farabi
2   Natural Language Processing  2019     Mohammed Ali
4  Robots Taking Over the World  2018     Abdul Rahman
Filtered Data (Articles published by 'Mohammed Ali'):
                         Title  Year       Authors
2  Natural Language Processing  2019  Mohammed Ali


### Exercise 7: Transforming AI Survey Data

Create a synthetic DataFrame `survey_data` containing survey responses on AI adoption. Calculate the average rating for each AI application category and create a new DataFrame to store the calculated values.

In [7]:
# Synthetic DataFrame with AI survey data
survey_data = pd.DataFrame({
    'AI Application': ['Image Recognition', 
                       'Natural Language Processing', 
                       'Speech Recognition', 
                       'Autonomous Vehicles', 
                       'Recommendation Systems', 
                       'Virtual Assistants'],
    'Rating': [4.2, 3.8, 4.5, 4.1, 4.3, 3.9]
})


# TODO: Calculate the average rating for each AI application category
average_ratings = survey_data.groupby('AI Application')['Rating'].mean()

# TODO: Create a new DataFrame to store the calculated average ratings
average_ratings_df = pd.DataFrame({
    'AI Application': average_ratings.index,
    'Average Rating': average_ratings.values
})

# Print the new DataFrame
print("Average Ratings:")
print(average_ratings_df)

Average Ratings:
                AI Application  Average Rating
0          Autonomous Vehicles             4.1
1            Image Recognition             4.2
2  Natural Language Processing             3.8
3       Recommendation Systems             4.3
4           Speech Recognition             4.5
5           Virtual Assistants             3.9


### Exercise 8: Aggregating AI Sales Data

Create a synthetic DataFrame `sales_data` containing information about AI product sales. Calculate the total sales amount for each product category and display the results.

In [8]:
# Synthetic DataFrame with AI sales data
sales_data = pd.DataFrame({
    'Product Category': ['Hardware', 'Software', 'Services', 'Hardware', 'Software'] * 200,
    'Sales Amount': [2500, 2100, 3900, 200, 2300] * 200
})

# Calculate the total sales amount for each product category
sales_totals = sales_data.groupby('Product Category')['Sales Amount'].sum()

# Calculate the average sales amount for each product category
sales_average = sales_data.groupby('Product Category')['Sales Amount'].mean()

# Print the sales totals
print("Sales Totals:")
print(sales_totals)

# Print the average sales amounts
print("Average Sales Amounts:")
print(sales_average)

Sales Totals:
Product Category
Hardware    540000
Services    780000
Software    880000
Name: Sales Amount, dtype: int64
Average Sales Amounts:
Product Category
Hardware    1350.0
Services    3900.0
Software    2200.0
Name: Sales Amount, dtype: float64


## Data Input and Output: Learn how to read/write data from/to different file formats

### Exercise 9: Writing Data to CSV File

Create a synthetic DataFrame `predictions` with IDs and boolean labels, and write it to a CSV file named "predictions.csv" using the default approach and with index column.

In [9]:
# Synthetic DataFrame with predictions
predictions = pd.DataFrame({
    'ID': [1, 2, 3, 4, 5],
    'Label': [True, False, True, True, False]
})

# Write the predictions to a CSV file (default approach)
predictions.to_csv('predictions_default.csv')

# Write the predictions to a CSV file (including index)
predictions.to_csv('predictions_with_index.csv', index=False)

### Exercise 10: Writing and Reading Data from JSON File

Using the synthetic DataFrame `predictions` with IDs and boolean labels, save the data to a JSON file named "predictions.json", and then load the same data from the JSON file.

In [10]:
# TODO: Save the predictions to a JSON file
predictions.to_json('predictions.json')

# TODO: Load the data from the JSON file
data = pd.read_json('predictions.json')

# Display the data
print(data)

   ID  Label
0   1   True
1   2  False
2   3   True
3   4   True
4   5  False
