Exercise 1: Identifying Data Types

Below are various data sources. Identify whether each one is an example of structured or unstructured data.

A company’s financial reports stored in an Excel file. --> Structured
Photographs uploaded to a social media platform. --> Unstructured
A collection of news articles on a website. --> Unstructured
Inventory data in a relational database. --> Structured
Recorded interviews from a market research study. --> Unstructured

Exercise 2: Transformation Exercise

For each of the following unstructured data sources, propose a method to convert it into structured data. Explain your reasoning.

1. Blog Posts About Travel Experiences
Method:
Use Natural Language Processing (NLP) techniques such as:
    Named Entity Recognition (NER) to extract locations, dates, landmarks, etc.
    Topic modeling (e.g., LDA) to categorize content into themes like "food", "adventure", "culture".
    Sentiment analysis to quantify emotional tone.

Structured Output Example:
Post ID	Location	Date Mentioned	Topic	Sentiment
001	Italy	2023-08	Food	Positive
002	Japan	2024-04	Adventure	Neutral

Reasoning:
Travel blogs typically follow a narrative, but they contain identifiable, extractable entities and events useful for tourism analytics, recommendations, or itinerary planning.


2. Audio Recordings of Customer Service Calls
Method:
Speech-to-Text (ASR) using tools like Whisper or Google Speech API.
Apply NLP to the transcript:
Extract customer intent.
Identify key issues or product references.
Detect sentiment and emotional tone.'

Structured Output Example:
Call ID	Customer Intent	Issue Type	Sentiment	Duration (min)
1001	Refund request	Billing	Negative	5.2
1002	Tech support	Connectivity	Neutral	12.5

Reasoning:
Audio is unstructured, but once transcribed, you can apply well-established NLP pipelines to analyze service quality, customer satisfaction, and agent performance.


3. Handwritten Notes from a Brainstorming Session
Method:
OCR (Optical Character Recognition) using tools like Tesseract or Google Vision API.
Post-process with text cleaning and clustering to organize ideas (e.g., using keywords, bullet structures).
Optionally, use topic modeling or keyword extraction to group related concepts.

Structured Output Example:
Idea ID	Topic	Description	Priority
01	Marketing	“Social media engagement”	Medium
02	Product Dev	“Voice control integration”	High

Reasoning:
Handwritten notes often use bullets, headings, or sketches. After OCR, these cues can be interpreted into structured formats like idea logs or product requirement tables.

4. A Video Tutorial on Cooking
Method:
Use video-to-text transcription (e.g., speech recognition for narration).
Combine with computer vision to detect objects (ingredients, tools) and actions (chop, boil).
Build a step-by-step recipe format.

Structured Output Example:
Step	Action	Ingredient	Duration (min)	Tool
1	Chop	Onions	2	Knife
2	Saute	Onions, garlic	5	Frying pan
3	Boil	Pasta	10	Saucepan

Reasoning:
Cooking videos are rich in visual and audio cues that map well to structured instruction sets. This format helps convert them into usable recipe databases or for training AI cooking assistants.

Exercise 3 : Import A File From Kaggle

Import the train dataset. Use the train.csv file.
Print the first few rows of the DataFrame.

In [2]:
import pandas as pd

df = pd.read_csv("train.csv", header=0, index_col='PassengerId')
df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Exercise 4: Importing A CSV File

Use the Iris Dataset CSV.

Download the Iris dataset CSV file and place it in the same directory as your Jupyter Notebook.
Import the CSV file using Pandas.
Display the first five rows of the dataset.

In [5]:
import pandas as pd

df = pd.read_csv("Iris_dataset.csv")
df.head()

Unnamed: 0,5.1,3.5,1.4,0.2,Iris-setosa
0,4.9,3.0,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
3,5.0,3.6,1.4,0.2,Iris-setosa
4,5.4,3.9,1.7,0.4,Iris-setosa


Exercise 5 : Export A Dataframe To Excel Format And JSON Format.

Create a simple dataframe.
Export the dataframe to an excel file.
Export the dataframe to a JSON file.

In [6]:
import pandas as pd

data = {
    'Book Title': ['The Great Gatsby', 'To Kill a Mockingbird', '1984', 'Pride and Prejudice', 'The Catcher in the Rye'],
    'Author': ['F. Scott Fitzgerald', 'Harper Lee', 'George Orwell', 'Jane Austen', 'J.D. Salinger'],
    'Genre': ['Classic', 'Classic', 'Dystopian', 'Classic', 'Classic'],
}

#Construire le dataframe à partir du dictionary
df = pd.DataFrame(data)

#Export as excel file
df.to_excel('simple_export.xlsx', sheet_name='Sheet1')

#Export as JSON file
df.to_json('simple_export.json')

Exercise 6: Reading JSON Data

Use a sample JSON dataset

Import the JSON data from the provided URL.
Use Pandas to read the JSON data.
Display the first five entries of the data.

In [14]:
import pandas as pd

df = pd.read_json("posts.json")
df.head()

Unnamed: 0,userId,id,title,body
0,1,1,sunt aut facere repellat provident occaecati e...,quia et suscipit\nsuscipit recusandae consequu...
1,1,2,qui est esse,est rerum tempore vitae\nsequi sint nihil repr...
2,1,3,ea molestias quasi exercitationem repellat qui...,et iusto sed quo iure\nvoluptatem occaecati om...
3,1,4,eum et est occaecati,ullam et saepe reiciendis voluptatem adipisci\...
4,1,5,nesciunt quas odio,repudiandae veniam quaerat sunt sed\nalias aut...
