## Exercise 1: Identifying Data Types

Below is the classification of various data sources as structured or unstructured data:

1. **A company’s financial reports stored in an Excel file**  — **Structured Data**  
   *Reason: The data is organized in a tabular format with predefined columns and rows.*

2. **Photographs uploaded to a social media platform**  — **Unstructured Data**  
   *Reason: Image files do not follow a formal data model and are not stored in a tabular format.*

3. **A collection of news articles on a website**  — **Unstructured Data**  
   *Reason: Text data from articles is not structured into columns and rows.*

4. **Inventory data in a relational database**  — **Structured Data**  
   *Reason: Stored in tables with clear schema definitions and data types.*

5. **Recorded interviews from a market research study**  — **Unstructured Data**  
   *Reason: Audio files lack structured schema and require transcription for structured analysis.*


## Exercise 2: Transformation Exercise

Below are methods for converting unstructured data into structured formats, along with reasoning and potential applications.

---

### 1. A Series of Blog Posts About Travel Experiences

**Purpose of Structuring:**  
To extract and retrieve practical information for travel planning and content analysis.

**Proposed Method:**  
- Assign metadata to each post:
  - Location(s) mentioned
  - Author
  - Date of publication
  - Tags/topics (e.g., food, culture, transport)
- Use tags instead of fixed columns to accommodate multiple themes per post.
- Store structured data in a format like JSON or in a database with a many-to-many relationship between posts and tags.

**Example JSON Structure:**
```json
{
  "title": "My Journey to Kyoto",
  "author": "alex_traveler",
  "date": "2024-07-18",
  "location": "Japan, Kyoto",
  "tags": ["temples", "food", "urban culture"],
  "url": "https://example.com/blog/kyoto-trip"
}
```

### 2. Audio Recordings of Customer Service Calls

**Purpose of Structuring:**  
To analyze customer interactions for quality improvement, training, and customer satisfaction metrics.

**Proposed Method:**

1. **Transcription**  
   Convert audio recordings to text using speech recognition tools such as:
   - Whisper (OpenAI)
   - Google Speech-to-Text
   - Amazon Transcribe

2. **Usefulness of Structuring**  
   The transcription allows:
   - Searching for specific phrases or issues.
   - Labeling examples of effective and ineffective communication.
   - Identifying key conversation phases:
     - Understanding customer needs
     - Offering relevant solutions
     - Closing the sale
     - Creating a positive customer experience
     - Asking for feedback or referrals

3. **Structuring Key Elements**  
   - Tag calls by quality of service: “good”, “needs improvement”
   - Extract metadata:
     - Date and time
     - Duration
     - Agent name/ID
     - Customer name/ID (if available)
     - Topics discussed
     - Emotional tone
   - Segment conversation phases:
     - Greeting
     - Problem description
     - Solution offering
     - Closing
     - Follow-up

4. **Advanced Features (optional)**  
   - Compare conversations with predefined scripts and flag deviations.
   - Detect tone, politeness, and emotion using NLP tools.
   - Identify recurring pain points or service gaps.

**Example JSON Structure:**

```json
{
  "call_id": 1243,
  "date": "2024-12-03",
  "duration": "00:05:47",
  "agent": "Anna S.",
  "customer": "Ivan P.",
  "topic": ["billing", "refund"],
  "sentiment": "negative",
  "transcript": "Здравствуйте, мне не пришёл возврат средств за товар...",
  "audio_url": "https://example.com/calls/1243.mp3"
}
```

### 3. Handwritten Notes from a Brainstorming Session

**Purpose of Structuring:**  
To organize raw creative input from a brainstorming session into actionable, searchable, and evaluable information.

**Proposed Method:**

1. **Digitization**  
   Convert handwritten notes into digital text using:
   - Optical Character Recognition (OCR)
   - Manual transcription

2. **Categorization by Purpose**  
   - Group ideas based on the goals they aim to achieve.
   - If session goals are unknown, use thematic clustering to identify major idea groups (e.g., product, marketing, technical).

3. **Author Attribution**  
   - If available, tag ideas with the name or initials of the person who proposed them.

4. **Status Tracking (if applicable)**  
   - Mark which ideas were later considered useful, feasible, or were rejected.
   - If no post-analysis was done, store all ideas as-is for future processing.

5. **Advanced Structuring (optional):**
   - Link ideas to potential projects or objectives.
   - Use scoring systems to evaluate:
     - Feasibility (1–5)
     - Potential impact (1–5)
   - Create categories like:
     - Accepted
     - Under review
     - Deferred
     - Rejected

**Example JSON Structure:**
```json
{
  "idea": "Integrate referral program into onboarding flow",
  "author": "Elena",
  "goal": "Increase user acquisition",
  "status": "Accepted",
  "feasibility": 4,
  "impact": 5
}
```

### 4. A Video Tutorial on Cooking

**Purpose of Structuring:**  
To make the video more navigable, informative, and easy to use for learning or preparation.

**Proposed Method:**

1. **Transcription**  
   - Convert spoken content into text using automatic transcription tools.
   - Enables text-based search and step extraction.

2. **Timecodes for Video Chapters**  
   - Add timestamps to mark different stages of the recipe:
     - Introduction
     - Ingredients
     - Preparation steps
     - Cooking
     - Serving
   - Allows quick navigation to any part of the tutorial.

3. **Metadata Extraction**  
   - Title of the recipe
   - Short description of the dish
   - Author/creator
   - Duration
   - Cuisine type, difficulty level, preparation time

4. **Structured Lists**  
   - **Ingredients** (quantities included)
   - **Required tools/equipment** (e.g., blender, oven, specific utensils)

5. **Optional Enhancements:**  
   - **Tagging**: classify by cuisine, meal type (e.g., breakfast, dessert), diet (vegan, keto).
   - **Step extraction using NLP**:
     - Parse the transcript to extract actionable steps:
       - “Chop the onions”
       - “Boil water”
       - “Mix all ingredients”  
     - Useful for generating summaries or step-by-step guides automatically.

**Example JSON Structure:**
```json
{
  "title": "How to Make Vegan Lasagna",
  "author": "Chef Elena",
  "duration": "00:12:45",
  "description": "A step-by-step tutorial for a plant-based lasagna recipe.",
  "ingredients": [
    "1 zucchini",
    "200g tomato sauce",
    "250g tofu"
  ],
  "tools": ["oven", "baking dish", "knife"],
  "steps": [
    {"time": "00:30", "instruction": "Chop the zucchini"},
    {"time": "02:15", "instruction": "Preheat the oven"},
    {"time": "03:40", "instruction": "Mix tofu with seasoning"}
  ],
  "tags": ["vegan", "lasagna", "italian"],
  "video_url": "https://example.com/videos/vegan-lasagna"
}
```


## Exercise 3 : Import a file from Kaggle

In [26]:
import pandas as pd

In [27]:
df_train = pd.read_csv("train.csv")

In [28]:
display(df_train.head(6))
display(df_train.tail(4))

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


## Exercise 4: Importing a CSV File

In [29]:
df_iris = pd.read_csv("Iris.csv")

In [30]:
display(df_iris.head())

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


## Exercise 5 : Export a dataframe to excel format and JSON format.

In [31]:
data_5 = [
    {"Name": "Alice", "Age": 25, "City": "New York"},
    {"Name": "Bob", "Age": 30, "City": "London"},
    {"Name": "Charlie", "Age": 35, "City": "Paris"}
]

df_5 = pd.DataFrame(data_5)


In [32]:
df_5.to_excel("data_5.xlsx", index=False)

In [33]:
df_5.to_json("data_5.json", orient="records", indent=4)

## Exercise 6: Reading JSON Data

In [34]:
df_post = pd.read_json("posts.json")

In [35]:
display(df_post.head())

Unnamed: 0,userId,id,title,body
0,1,1,sunt aut facere repellat provident occaecati e...,quia et suscipit\nsuscipit recusandae consequu...
1,1,2,qui est esse,est rerum tempore vitae\nsequi sint nihil repr...
2,1,3,ea molestias quasi exercitationem repellat qui...,et iusto sed quo iure\nvoluptatem occaecati om...
3,1,4,eum et est occaecati,ullam et saepe reiciendis voluptatem adipisci\...
4,1,5,nesciunt quas odio,repudiandae veniam quaerat sunt sed\nalias aut...
