## Exercises XP: W2_D2

### Exercise 1: Identifying Data Types

#### Below are various data sources. Identify whether each one is an example of structured or unstructured data.

| Data Source                                 | Type         |
|---------------------------------------------|--------------|
| Excel financial report                      | Structured   |
| Social media photos                         | Unstructured |
| News articles                               | Unstructured |
| Inventory in a relational database          | Structured   |
| Market research recorded interviews         | Unstructured |


### Exercise 2: Transformation Exercise

##### For each unstructured source below, I suggested a possible method to convert it into structured data, and explained why.

| Unstructured Data Source                       | Method to Convert to Structured Data                                     | Reasoning                                                           |
| ---------------------------------------------- | ------------------------------------------------------------------------ | ------------------------------------------------------------------- |
| Blog posts about travel experiences            | Use NLP techniques to extract keywords, topics, sentiment into a table   | Text analysis can turn paragraphs into structured categories        |
| Audio recordings of customer service calls     | Apply speech-to-text, then analyze transcripts                           | Transcripts can be parsed into conversation topics, durations, etc. |
| Handwritten notes from a brainstorming session | Use OCR (Optical Character Recognition) to extract text                  | Once digitized, text can be organized by idea, category, or author  |
| A video tutorial on cooking                    | Extract audio and apply speech-to-text, then tag video segments manually | Allows conversion into step-by-step structured instructions         |


### Exercise 3 : Import a file from Kaggle

In [18]:
import pandas as pd  # Load the pandas library

In [19]:
from kaggle.api.kaggle_api_extended import KaggleApi
import pandas as pd
import zipfile

In [20]:
# 1. Authentification via le fichier kaggle.json déjà placé dans C:\Users\julia\.kaggle\
api = KaggleApi()
api.authenticate()

In [21]:
# 2. Télécharger le dataset ZIP dans le dossier courant
api.dataset_download_files('hesh97/titanicdataset-traincsv', path='.', unzip=False)

Dataset URL: https://www.kaggle.com/datasets/hesh97/titanicdataset-traincsv


In [22]:
# 3. Décompresser le fichier zip
zip_path = 'titanicdataset-traincsv.zip'
if os.path.exists(zip_path):
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall('.')  # Dézippe les fichiers ici
    print("Le fichier ZIP a été extrait avec succès.")

Le fichier ZIP a été extrait avec succès.


In [23]:
# 4. Charger le CSV dans un DataFrame Pandas
csv_path = 'train.csv'
if os.path.exists(csv_path):
    df = pd.read_csv(csv_path)
    print("✅ Données chargées depuis train.csv :")
    print(df.head())
else:
    print("❌ train.csv introuvable.")

✅ Données chargées depuis train.csv :
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0         

### Exercise 4: Importing a CSV File

In [24]:
import pandas as pd  # Import pandas library

# Read the iris.csv file (make sure it's in the same folder as your notebook/script)
df = pd.read_csv("iris.csv")

# Display the first five rows of the dataset
print(df.head())

   Id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm      Species
0   1            5.1           3.5            1.4           0.2  Iris-setosa
1   2            4.9           3.0            1.4           0.2  Iris-setosa
2   3            4.7           3.2            1.3           0.2  Iris-setosa
3   4            4.6           3.1            1.5           0.2  Iris-setosa
4   5            5.0           3.6            1.4           0.2  Iris-setosa


### Exercise 5 : Export a dataframe to excel format and JSON format.

In [26]:
import pandas as pd  # Import pandas

# Create a simple dataframe
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['Paris', 'London', 'Berlin']
}
df = pd.DataFrame(data)

# Export the dataframe to an Excel file
df.to_excel("people.xlsx", index=False)  # index=False to avoid writing row numbers

# Export the dataframe to a JSON file
df.to_json("people.json", orient="records", indent=4)  # formatted JSON output

print(df)

      Name  Age    City
0    Alice   25   Paris
1      Bob   30  London
2  Charlie   35  Berlin


### Exercise 6: Reading JSON Data

In [27]:
import pandas as pd  # Import pandas

# Load JSON data from a URL
url = "https://jsonplaceholder.typicode.com/users"

# Read the JSON data into a DataFrame
df = pd.read_json(url)

# Display the first five entries
print(df.head())

   id              name   username                      email  \
0   1     Leanne Graham       Bret          Sincere@april.biz   
1   2      Ervin Howell  Antonette          Shanna@melissa.tv   
2   3  Clementine Bauch   Samantha         Nathan@yesenia.net   
3   4  Patricia Lebsack   Karianne  Julianne.OConner@kory.org   
4   5  Chelsey Dietrich     Kamren   Lucio_Hettinger@annie.ca   

                                             address                  phone  \
0  {'street': 'Kulas Light', 'suite': 'Apt. 556',...  1-770-736-8031 x56442   
1  {'street': 'Victor Plains', 'suite': 'Suite 87...    010-692-6593 x09125   
2  {'street': 'Douglas Extension', 'suite': 'Suit...         1-463-123-4447   
3  {'street': 'Hoeger Mall', 'suite': 'Apt. 692',...      493-170-9623 x156   
4  {'street': 'Skiles Walks', 'suite': 'Suite 351...          (254)954-1289   

         website                                            company  
0  hildegard.org  {'name': 'Romaguera-Crona', 'catchPhrase': 'Mu