# Data Science Fundamentals and Importing Semi-Structured Data Lab

**Description:**

In this lab, you will explore fundamental data science concepts, including Python syntax, loops, list comprehensions, working with nested data structures (dictionaries and lists), and importing data from CSV files. You will also create a CSV file as part of the exercise.


# Part 1: Python Syntax and Basic Variables

**1.1. Printing a Message**

```python
print("Task 1: Creating a List of Squares")


## 1. Generate a list of the squares of the first 10 positive integers using a list comprehension.

In [40]:
squares = []
for i in range(1, 11):
    squares.append(i**2)
print(squares)

[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]


In [41]:
#Now we create a list of the first 20 positive integers.
numbers = list(range(1, 21))
print(numbers)

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]


## 2. Use a list comprehension to filter and create a new list containing only the even numbers from the list of numbers.

In [42]:
even_numbers = []
for num in numbers:
    if num % 2 == 0:
        even_numbers.append(num)
print(even_numbers)

[2, 4, 6, 8, 10, 12, 14, 16, 18, 20]


# Part 2: Nested Lists of Dictionaries (Semi-Structured Data)

## You have been provided with a dataset that contains sales information for a small business. The dataset is represented as a list of dictionaries, where each dictionary represents a sales transaction with the following keys: "date," "product," "quantity," and "revenue."

In [43]:
sales_data = [
    {"date": "2023-01-01", "product": "Product A", "quantity": 5, "revenue": 100.00},
    {"date": "2023-01-01", "product": "Product B", "quantity": 3, "revenue": 60.00},
    {"date": "2023-01-02", "product": "Product A", "quantity": 2, "revenue": 40.00},
    {"date": "2023-01-02", "product": "Product C", "quantity": 1, "revenue": 25.00},
    {"date": "2023-01-03", "product": "Product B", "quantity": 4, "revenue": 80.00},
    {"date": "2023-01-03", "product": "Product C", "quantity": 2, "revenue": 40.00},
    {"date": "2023-01-04", "product": "Product A", "quantity": 3, "revenue": 60.00},
    {"date": "2023-01-04", "product": "Product B", "quantity": 2, "revenue": 40.00},
    {"date": "2023-01-05", "product": "Product A", "quantity": 2, "revenue": 40.00},
    {"date": "2023-01-05", "product": "Product C", "quantity": 1, "revenue": 20.00},
]
sales_data

[{'date': '2023-01-01',
  'product': 'Product A',
  'quantity': 5,
  'revenue': 100.0},
 {'date': '2023-01-01',
  'product': 'Product B',
  'quantity': 3,
  'revenue': 60.0},
 {'date': '2023-01-02',
  'product': 'Product A',
  'quantity': 2,
  'revenue': 40.0},
 {'date': '2023-01-02',
  'product': 'Product C',
  'quantity': 1,
  'revenue': 25.0},
 {'date': '2023-01-03',
  'product': 'Product B',
  'quantity': 4,
  'revenue': 80.0},
 {'date': '2023-01-03',
  'product': 'Product C',
  'quantity': 2,
  'revenue': 40.0},
 {'date': '2023-01-04',
  'product': 'Product A',
  'quantity': 3,
  'revenue': 60.0},
 {'date': '2023-01-04',
  'product': 'Product B',
  'quantity': 2,
  'revenue': 40.0},
 {'date': '2023-01-05',
  'product': 'Product A',
  'quantity': 2,
  'revenue': 40.0},
 {'date': '2023-01-05',
  'product': 'Product C',
  'quantity': 1,
  'revenue': 20.0}]

## 3. Create a dictionary that summarizes the total quantity and revenue for each product across all transactions. The dictionary should have the product name as the key and a dictionary containing the total quantity and revenue as the value.

In [44]:
product_summary = {}

# Solution 
for transaction in sales_data:
    product = transaction["product"]
    quantity = transaction["quantity"]
    revenue = transaction["revenue"]
    
    if product in product_summary:
        product_summary[product]["total_quantity"] += quantity
        product_summary[product]["total_revenue"] += revenue
    else:
        product_summary[product] = {"total_quantity": quantity, "total_revenue": revenue}

print("Product Summary:")
print(product_summary)
print()

Product Summary:
{'Product A': {'total_quantity': 12, 'total_revenue': 240.0}, 'Product B': {'total_quantity': 9, 'total_revenue': 180.0}, 'Product C': {'total_quantity': 4, 'total_revenue': 85.0}}



## 4. Calculate the total revenue for each day.

In [45]:

daily_revenue = {}

for transaction in sales_data:
    date = transaction["date"]
    revenue = transaction["revenue"]
   
    if date in daily_revenue:
        daily_revenue[date] += revenue
    else:
        daily_revenue[date] = revenue

print("Daily Revenue:")
print(daily_revenue)
print()

Daily Revenue:
{'2023-01-01': 160.0, '2023-01-02': 65.0, '2023-01-03': 120.0, '2023-01-04': 100.0, '2023-01-05': 60.0}



## 5. Find the date with the highest and lowest total revenue.

In [46]:
max_revenue_date = max(daily_revenue, key=daily_revenue.get)
min_revenue_date = min(daily_revenue, key=daily_revenue.get)

print("Date with Highest Total Revenue:", max_revenue_date)
print("Date with Lowest Total Revenue:", min_revenue_date)
print()


Date with Highest Total Revenue: 2023-01-01
Date with Lowest Total Revenue: 2023-01-05



## 6. Create a new list of dictionaries that contains only the transactions for "Product A."

In [47]:
product_a_transactions = [transaction for transaction in sales_data if transaction["product"] == "Product A"]
print("Transactions for Product A:")
print(product_a_transactions)
print()

Transactions for Product A:
[{'date': '2023-01-01', 'product': 'Product A', 'quantity': 5, 'revenue': 100.0}, {'date': '2023-01-02', 'product': 'Product A', 'quantity': 2, 'revenue': 40.0}, {'date': '2023-01-04', 'product': 'Product A', 'quantity': 3, 'revenue': 60.0}, {'date': '2023-01-05', 'product': 'Product A', 'quantity': 2, 'revenue': 40.0}]



## 7. Calculate the average revenue per transaction for "Product B."

In [48]:
product_b_transactions = [transaction for transaction in sales_data if transaction["product"] == "Product B"]
total_revenue_product_b = sum(transaction["revenue"] for transaction in product_b_transactions)
average_revenue_per_transaction_product_b = total_revenue_product_b / len(product_b_transactions)

print("Average Revenue per Transaction for Product B:", average_revenue_per_transaction_product_b)
print()

Average Revenue per Transaction for Product B: 60.0



## 8. Determine the most sold product (product with the highest total quantity).

In [49]:
most_sold_product = max(sales_data, key=lambda transaction: transaction["quantity"])["product"]

print("Most Sold Product:", most_sold_product)
print()

Most Sold Product: Product A



## 9. Calculate the total revenue for each product.

In [50]:
total_revenue_per_product = {}
for transaction in sales_data:
    product = transaction["product"]
    quantity = transaction["quantity"]
    revenue = transaction["revenue"]
    # Calculate the total revenue for the current product
    total_revenue = total_revenue_per_product.get(product, 0) + revenue
    # Update the total revenue for the product in the dictionary
    total_revenue_per_product[product] = total_revenue
print("Total Revenue per Product:")
print(total_revenue_per_product)

Total Revenue per Product:
{'Product A': 240.0, 'Product B': 180.0, 'Product C': 85.0}


In [51]:
# Import necessary libraries
import pandas as pd
import json

# Load the CSV dataset
df = pd.read_csv("lab_2/semi_strut.csv")

FileNotFoundError: [Errno 2] No such file or directory: 'lab_2/semi_strut.csv'

In [None]:
#inspect data
df.head()

NameError: name 'df' is not defined

In [None]:
# Tokenization function to extract terms from the JSON-like content
def json_loads(content):
    content_dict = json.loads(content)
    return content_dict
# Apply function to df, save as JSON column
df["diction"] = df["Content"].apply(json_loads)

In [None]:
df

Unnamed: 0,Document ID,Content,Terms,JSON,diction
0,1,"{\n ""title"": ""Introduction to Python"",\n ""...",[],"{'title': 'Introduction to Python', 'author': ...","{'title': 'Introduction to Python', 'author': ..."
1,2,"{\n ""title"": ""Data Analysis with Pandas"",\n ...",[],"{'title': 'Data Analysis with Pandas', 'author...","{'title': 'Data Analysis with Pandas', 'author..."
2,3,"{\n ""title"": ""Web Development with Flask"",\n...",[],"{'title': 'Web Development with Flask', 'autho...","{'title': 'Web Development with Flask', 'autho..."
3,4,"{\n ""title"": ""Machine Learning with Scikit-L...",[],{'title': 'Machine Learning with Scikit-Learn'...,{'title': 'Machine Learning with Scikit-Learn'...
4,5,"{\n ""title"": ""Data Visualization with Matplo...",[],{'title': 'Data Visualization with Matplotlib'...,{'title': 'Data Visualization with Matplotlib'...


In [None]:
# Tokenization function to extract terms from the JSON-like content
def tokenize_content(content):
    terms = []
    
    # Extract terms from various fields (title)
    terms.extend(content.get("title", "").split())
#     terms.extend(content.get("author", "").split())
#     terms.extend(content.get("keywords", []))
    
#     # Extract terms from sections' titles and content
#     sections = content.get("sections", [])
#     for section in sections:
#         terms.extend(section.get("title", "").split())
#         terms.extend(section.get("content", "").split())
    
    return terms

# Tokenize the content and create a new column "Terms"
df["Terms"] = df["diction"].apply(tokenize_content)

# Display the DataFrame with the "Terms" column
df[["Document ID", "Terms"]]

Unnamed: 0,Document ID,Terms
0,1,"[Introduction, to, Python]"
1,2,"[Data, Analysis, with, Pandas]"
2,3,"[Web, Development, with, Flask]"
3,4,"[Machine, Learning, with, Scikit-Learn]"
4,5,"[Data, Visualization, with, Matplotlib]"


In [None]:
# Tokenization function to extract terms from the JSON-like content
def tokenize_content_title(content):
    terms_title = []
    
    # Extract terms from various fields (title)
    terms_title.extend(content.get("title", "").split())
    
    return terms_title

# Tokenize the content and create a new column "Terms"
df["Terms_title"] = df["diction"].apply(tokenize_content)

# Display the DataFrame with the "Terms" column
df[["Document ID", "Terms_title"]]

Unnamed: 0,Document ID,Terms_title
0,1,"[Introduction, to, Python]"
1,2,"[Data, Analysis, with, Pandas]"
2,3,"[Web, Development, with, Flask]"
3,4,"[Machine, Learning, with, Scikit-Learn]"
4,5,"[Data, Visualization, with, Matplotlib]"


In [None]:
# Tokenization function to extract terms from the JSON-like content
def tokenize_content_keyword(content):
    terms_key = []
    
    # Extract terms from various fields (title)
    terms_key.extend(content.get("keywords", []))
    
    return terms_key

# Tokenize the content and create a new column "Terms"
df["Terms_keywords"] = df["diction"].apply(tokenize_content_keyword)

# Display the DataFrame with the "Terms" column
df[["Document ID", "Terms_keywords"]]

Unnamed: 0,Document ID,Terms_keywords
0,1,"[Python, programming, beginner]"
1,2,"[Python, Pandas, data analysis]"
2,3,"[Python, Flask, web development]"
3,4,"[Python, machine learning, Scikit-Learn]"
4,5,"[Python, Matplotlib, data visualization]"


In [None]:
df

Unnamed: 0,Document ID,Content,Terms,JSON,diction,Terms_title,Terms_keywords
0,1,"{\n ""title"": ""Introduction to Python"",\n ""...","[Introduction, to, Python]","{'title': 'Introduction to Python', 'author': ...","{'title': 'Introduction to Python', 'author': ...","[Introduction, to, Python]","[Python, programming, beginner]"
1,2,"{\n ""title"": ""Data Analysis with Pandas"",\n ...","[Data, Analysis, with, Pandas]","{'title': 'Data Analysis with Pandas', 'author...","{'title': 'Data Analysis with Pandas', 'author...","[Data, Analysis, with, Pandas]","[Python, Pandas, data analysis]"
2,3,"{\n ""title"": ""Web Development with Flask"",\n...","[Web, Development, with, Flask]","{'title': 'Web Development with Flask', 'autho...","{'title': 'Web Development with Flask', 'autho...","[Web, Development, with, Flask]","[Python, Flask, web development]"
3,4,"{\n ""title"": ""Machine Learning with Scikit-L...","[Machine, Learning, with, Scikit-Learn]",{'title': 'Machine Learning with Scikit-Learn'...,{'title': 'Machine Learning with Scikit-Learn'...,"[Machine, Learning, with, Scikit-Learn]","[Python, machine learning, Scikit-Learn]"
4,5,"{\n ""title"": ""Data Visualization with Matplo...","[Data, Visualization, with, Matplotlib]",{'title': 'Data Visualization with Matplotlib'...,{'title': 'Data Visualization with Matplotlib'...,"[Data, Visualization, with, Matplotlib]","[Python, Matplotlib, data visualization]"


In [None]:
df['Concatenated'] = df.apply(lambda row: pd.concat([pd.Series(row['Terms_title']), pd.Series(row['Terms_keywords'])]).tolist(), axis=1)

In [None]:
df

Unnamed: 0,Document ID,Content,Terms,JSON,diction,Terms_title,Terms_keywords,Concatenated
0,1,"{\n ""title"": ""Introduction to Python"",\n ""...","[Introduction, to, Python]","{'title': 'Introduction to Python', 'author': ...","{'title': 'Introduction to Python', 'author': ...","[Introduction, to, Python]","[Python, programming, beginner]","[Introduction, to, Python, Python, programming..."
1,2,"{\n ""title"": ""Data Analysis with Pandas"",\n ...","[Data, Analysis, with, Pandas]","{'title': 'Data Analysis with Pandas', 'author...","{'title': 'Data Analysis with Pandas', 'author...","[Data, Analysis, with, Pandas]","[Python, Pandas, data analysis]","[Data, Analysis, with, Pandas, Python, Pandas,..."
2,3,"{\n ""title"": ""Web Development with Flask"",\n...","[Web, Development, with, Flask]","{'title': 'Web Development with Flask', 'autho...","{'title': 'Web Development with Flask', 'autho...","[Web, Development, with, Flask]","[Python, Flask, web development]","[Web, Development, with, Flask, Python, Flask,..."
3,4,"{\n ""title"": ""Machine Learning with Scikit-L...","[Machine, Learning, with, Scikit-Learn]",{'title': 'Machine Learning with Scikit-Learn'...,{'title': 'Machine Learning with Scikit-Learn'...,"[Machine, Learning, with, Scikit-Learn]","[Python, machine learning, Scikit-Learn]","[Machine, Learning, with, Scikit-Learn, Python..."
4,5,"{\n ""title"": ""Data Visualization with Matplo...","[Data, Visualization, with, Matplotlib]",{'title': 'Data Visualization with Matplotlib'...,{'title': 'Data Visualization with Matplotlib'...,"[Data, Visualization, with, Matplotlib]","[Python, Matplotlib, data visualization]","[Data, Visualization, with, Matplotlib, Python..."
