# Data Processing Complete Notebook

This notebook is made to do the complete data processing of our data. It will mostly cover the cleaning process and merging process of the datasets. This notebook will cover the following :

- Data Cleaning: Handling missing values, removing duplicates, and correcting data types.

- Data Integration: Merging datasets, joining tables, and aggregating data.

With this done, we will be able to use our two haikus datasets, the french and the english dataset, to train our model after treating them for our training. 

In [1]:
import pandas as pd
from langdetect import detect
import numpy as np

## Temps libre loading

In this section, we are processing data from the Temps Libre website, which I scraped using my custom code. 
The code used for scraping can be found at the following link: [Scraping Code](insert-your-link-here)

This step involves cleaning and preparing the haiku data for further analysis and use in our project.

This website was found in the list of this GitHub during my research about the subject : [haiku-scraper's GitHub](https://github.com/ytixu/haiku-scraper/tree/master).

This is the website that has be scrapped to get this data : [Temps libre](https://www.tempslibres.org/tl/tlphp/dblang.php?lg=e). 

In [2]:
# Load the file
df_tl_en = pd.read_csv('database/raw_data/english_tempslibre_csv.csv', encoding='utf8')

print(f"Dataset lenght : {len(df_tl_en)}")
df_tl_en.head()

Dataset lenght : 5664


Unnamed: 0,line_1,line_2,line_3,source
0,zen garden --,stirring a passing cloud,in my tea,tempslibre
1,mosquee blackout,I come home,with new slippers,tempslibre
2,fishing boats,colors of,the rainbow,tempslibre
3,ash wednesday--,trying to remember,my dream,tempslibre
4,snowy morn--,pouring another cup,of black coffee,tempslibre


In [3]:
# Load the file
df_tl_fr_1 = pd.read_csv('database/raw_data/french_tempslibre_csv.csv', encoding='utf8')
df_tl_fr_2 = pd.read_csv('database/raw_data/french_2_tempslibre_csv.csv', encoding='utf8')

# Concat the two french csv into one
dfs_tl_fr_1_2 = [df_tl_fr_1, df_tl_fr_2]
df_tl_fr = pd.concat(dfs_tl_fr_1_2)

print(f"Dataset lenght : {len(df_tl_fr)}")
df_tl_fr.head()

Dataset lenght : 9942


Unnamed: 0,line_1,line_2,line_3,source
0,Sur une colonne en bois,le vendeur des ballons,hisse ses souffles,tempslibre
1,Nuit froide marchant seule,d'une maison une odeur,de beignets au sucre,tempslibre
2,brume matinale,dans la tasse de café,la buée de l’arome,tempslibre
3,averse au jardin,elle a fait naufrage,la tarte aux pommes,tempslibre
4,sécheresse,dans l’eau vaseuse,un ibis blanc,tempslibre


## Herons Nest loading

In this section, we are processing data from the Herons Nest website, which I scraped using my custom code. 
The code used for scraping can be found at the following link: [Scraping Code](insert-your-link-here)

This step involves cleaning and preparing the haiku data for further analysis and use in our project.

This website was found in the list of this GitHub during my research about the subject : [haiku-scraper's GitHub](https://github.com/ytixu/haiku-scraper/tree/master).

This is the website that has be scrapped to get this data : [Herons Nest](https://theheronsnest.com/June2024/index.html). 

In [4]:
# Load the file
df_hn_en = pd.read_csv('database/raw_data/heronsnest_csv.csv', encoding='utf8')

print(f"Dataset lenght : {len(df_hn_en)}")
df_hn_en.head()

Dataset lenght : 1925


Unnamed: 0,line_1,line_2,line_3,source
0,early spring,a hint of green,in your eyes,herons_nest
1,new home,morning birdsong,in a different key,herons_nest
2,quilting squares...,the repeating patterns,of a life,herons_nest
3,putting away,my mother's letters,tulips in bloom,herons_nest
4,bullet train,just a glimpse,of the rice farmer,herons_nest


## Modern Haikus Loading

In this section, we are processing data from the Modern Haikus website, which I scraped using my custom code. 
The code used for scraping can be found at the following link: [Scraping Code](insert-your-link-here)

This step involves cleaning and preparing the haiku data for further analysis and use in our project.

This website was found in the list of this GitHub during my research about the subject : [haiku-scraper's GitHub](https://github.com/ytixu/haiku-scraper/tree/master).

This is the website that has be scrapped to get this data : [Moderns Haikus](http://www.modernhaiku.org/previousissue.html). 

In [5]:
# Load the file
df_mh_en = pd.read_csv('database/raw_data/modern_haikus_csv.csv', encoding='utf8')

print(f"Dataset lenght : {len(df_mh_en)}")
df_mh_en.head()

Dataset lenght : 369


Unnamed: 0,line_1,line_2,line_3,source
0,voicemail —,the immediacy,of winter rain,modern_haikus
1,wind chimes,the laughter of grandchildren,all moved away,modern_haikus
2,highschool,trophy case —,her maiden name,modern_haikus
3,holding our breaths,releasing them,humpbacks,modern_haikus
4,Groundhog Day,motorcyclists see,their shadows,modern_haikus


## Github Haikuzao Txt Data Loading

In this section, we are processing data from a GitHub repository that I found during my research. This data will be treated and prepared for our usage.

As observed, there are too few French haikus (and some of them are actually in English), so we will not keep them.

Source: [Haikuzao's Github](https://github.com/herval/creative_machines/tree/master/haikuzao) 

In [6]:
with open('database/raw_data/haikuzao_data.txt', 'r') as f:
    text = f.read()

df_hg = [i for i in text.split('\n\n') if len(i.split('\n')) == 3]

df_hg_en_list = []
df_hg_fr_list = []

for h in df_hg :
    h_splitted = h.split('\n')
    h_str = h_splitted[0] + " " + h_splitted[1] + " " + h_splitted[2]
    if detect(h_str) == "en" :
        df_hg_en_list.append(h_splitted)
    elif detect(h_str) == "fr" :
        df_hg_fr_list.append(h_splitted)

cols_names = ["line_1", "line_2", "line_3"]
df_hg_eng = pd.DataFrame(df_hg_en_list, columns=cols_names)
df_hg_fr = pd.DataFrame(df_hg_fr_list, columns=cols_names)

print(f"English dataset lenght : {len(df_hg_eng)}")
print(f"French dataset lenght : {len(df_hg_fr)}")

df_hg_eng['source'] = 'haikuzao'
df_hg_eng.head()

English dataset lenght : 4725
French dataset lenght : 14


Unnamed: 0,line_1,line_2,line_3,source
0,a skein of birds,twines across the sky,the northbound train departs,haikuzao
1,dawn chorus begins,I reach for,the snooze button,haikuzao
2,new March snow,the grouse with a missing toe,still around,haikuzao
3,Remembrance Day-,even the traffic,pauses for 2 minutes,haikuzao
4,dignified march-,veterans and peacekeepers,pass the applause,haikuzao


## Kaggle Haikus dataset Loading

In this section, we are processing data from a Kaggle that I found during my research. This data will be treated and prepared for our usage.

Source: [Kaggle's Page](https://www.kaggle.com/datasets/hjhalani30/haiku-dataset?resource=download) 

In [7]:
# Load the CSV file
df_kaggle = pd.read_csv('database/raw_data/all_haiku_kaggle.csv')

df_kaggle = df_kaggle.drop(columns=['Unnamed: 0', 'hash'])
df_kaggle.columns = ['line_1', 'line_2', 'line_3', 'source']

# Display the first 5 lines
print(f"Dataset lenght : {len(df_kaggle)}")
df_kaggle.head()

Dataset lenght : 144123


Unnamed: 0,line_1,line_2,line_3,source
0,fishing boats,colors of,the rainbow,tempslibres
1,ash wednesday--,trying to remember,my dream,tempslibres
2,snowy morn--,pouring another cup,of black coffee,tempslibres
3,shortest day,flames dance,in the oven,tempslibres
4,haze,half the horse hidden,behind the house,tempslibres


## Creating the Final Two CSV Files (English and French)

In this section, we will combine and save all of our datasets into two comprehensive CSV files—one for English and one for French.

This approach will enable us to reuse the data later without needing to process all of our datasets again.

In [8]:
final_french_dataset = df_tl_fr # For now, we only have this one as the others are in english of too small and mixed with english
final_english_dataset = pd.concat([df_tl_en, df_hn_en, df_mh_en, df_hg_eng, df_kaggle])

print(f"Final english dataset lenght : {len(final_english_dataset)}")
print(f"Final french dataset lenght : {len(final_french_dataset)}")

print(final_english_dataset.head())
print(final_french_dataset.head())

final_english_dataset.to_csv('database/final_data/final_english_dataset.csv', index=False)
final_french_dataset.to_csv('database/final_data/final_french_dataset.csv', index=False)

Final english dataset lenght : 156806
Final french dataset lenght : 9942
             line_1                    line_2             line_3      source
0     zen garden --  stirring a passing cloud          in my tea  tempslibre
1  mosquee blackout               I come home  with new slippers  tempslibre
2     fishing boats                 colors of        the rainbow  tempslibre
3   ash wednesday--       trying to remember            my dream  tempslibre
4      snowy morn--       pouring another cup    of black coffee  tempslibre
                       line_1                  line_2                 line_3  \
0     Sur une colonne en bois  le vendeur des ballons    hisse ses souffles    
1  Nuit froide marchant seule  d'une maison une odeur  de beignets au sucre    
2              brume matinale   dans la tasse de café    la buée de l’arome    
3            averse au jardin    elle a fait naufrage   la tarte aux pommes    
4                  sécheresse      dans l’eau vaseuse         un 

This may not be a lot of data, but we will work with it for now. Some websites that I tried to scrape were not easy to work with and will need to be reviewed later to add more data to the project.

However, to further enhance the project, it would be beneficial to collect more data from multiple sources in both French and English.

By doing so, we can improve the diversity and quality of the haikus, leading to better training for our models and generating more accurate results.

Additionally, this will give me a chance to get even better at web scraping and learn more about this field.

## Creating hashes for our haikus 

Creating hashes for our Haikus allows us to efficiently identify and remove duplicate entries. 

By generating a unique hash for each Haiku based on its content, we can easily compare and filter out duplicates, ensuring that our dataset is clean and contains only unique Haikus. 

This will allow us to check our dataset and ensure a better training of the models latter on.

In [9]:
import pandas as pd

# Read the CSV files
english_haikus = pd.read_csv('database/final_data/final_english_dataset.csv')
french_haikus = pd.read_csv('database/final_data/final_french_dataset.csv')

print(f"English dataset lenght before duplicates dropping : {len(english_haikus)}")
print(f"French dataset lenght before duplicates dropping : {len(french_haikus)}")

# Process English Haikus
# TODO : check the hashes generation to avoid punctuation and spaces in the hash
english_haikus['hash'] = (english_haikus['line_1'] + english_haikus['line_2'] + english_haikus['line_3']).str.replace(r'[^A-Za-z]', '').str.upper()
english_haikus = english_haikus.drop_duplicates(subset=['hash'])
english_haikus = english_haikus.drop(columns=['hash'])

# Process French Haikus
french_haikus['hash'] = (french_haikus['line_1'] + french_haikus['line_2'] + french_haikus['line_3']).str.replace(r'[^A-Za-z]', '').str.upper()
french_haikus = french_haikus.drop_duplicates(subset=['hash'])
french_haikus = french_haikus.drop(columns=['hash'])

# Display the processed DataFrames
len(english_haikus), len(french_haikus)

English dataset lenght before duplicates dropping : 156806
French dataset lenght before duplicates dropping : 9942


(149193, 8364)

## Datasets verifications

Here, we are detecting and handling problematic lines to ensure the integrity of our Haiku datasets. 

By identifying lines that are not strings or contain NaN values, we can drop them from our data and by doing so, prevent them from affecting our treatment and training. 

Dropping rows with NaN values and detecting problematic lines helps maintain a clean dataset, which is crucial for the next steps of the project.

In [10]:
# Function to detect problematic lines
def detect_problematic_lines(df):
    problematic_lines = []
    for i in range(1, 4):
        for index, value in df[f'line_{i}'].items():
            if not isinstance(value, str) or pd.isna(value):
                print(f"Line {i} of haiku {index} is not a string or is NaN")
                print(value)
                problematic_lines.append((index, f'line_{i}', value))
    return problematic_lines

# Drop rows with NaN values in 'line_1', 'line_2', or 'line_3'
english_haikus = english_haikus.dropna(subset=['line_1', 'line_2', 'line_3'])
french_haikus = french_haikus.dropna(subset=['line_1', 'line_2', 'line_3'])

# Detect problematic lines
english_problematic_lines = detect_problematic_lines(english_haikus)
french_problematic_lines = detect_problematic_lines(french_haikus)

# Display the problematic lines
print(english_problematic_lines)
print(french_problematic_lines)

[]
[]


## Sylabus counting

Counting syllables in Haikus is essential to ensure they adhere to the traditional 5-7-5 syllable structure and to test some new things with our models later on by tweaking those values. 

Even if here, they do not all have this specific structure, it will help the model to understand the format of the data that it is training on and that it will have to generate after.

This step is quite important for the treatment of our data and to make it understandable for our models.

In [11]:
import pandas as pd
import syllables

# Function to count syllables in each line and sum them
def count_and_sum_syllables(df):
    df['line1_syllables'] = df['line_1'].apply(syllables.estimate)
    df['line2_syllables'] = df['line_2'].apply(syllables.estimate)
    df['line3_syllables'] = df['line_3'].apply(syllables.estimate)
    return df

# Drop rows with NaN values in 'line_1', 'line_2', or 'line_3'
english_haikus = english_haikus.dropna(subset=['line_1', 'line_2', 'line_3'])
french_haikus = french_haikus.dropna(subset=['line_1', 'line_2', 'line_3'])

# Count and sum syllables
english_haikus = count_and_sum_syllables(english_haikus)
french_haikus = count_and_sum_syllables(french_haikus)

# Display the processed DataFrames
print(english_haikus.head())
print(french_haikus.head())

             line_1                    line_2             line_3      source  \
0     zen garden --  stirring a passing cloud          in my tea  tempslibre   
1  mosquee blackout               I come home  with new slippers  tempslibre   
2     fishing boats                 colors of        the rainbow  tempslibre   
3   ash wednesday--       trying to remember            my dream  tempslibre   
4      snowy morn--       pouring another cup    of black coffee  tempslibre   

   line1_syllables  line2_syllables  line3_syllables  
0                3                6                3  
1                4                5                4  
2                3                3                3  
3                4                5                2  
4                3                6                4  
                       line_1                  line_2                 line_3  \
0     Sur une colonne en bois  le vendeur des ballons    hisse ses souffles    
1  Nuit froide marchant seule

In [12]:
# Function to process a dataset
def final_process_haikus(df):
    # Compute the mean of syllables
    mean_syllables = np.mean(df[['line1_syllables', 'line2_syllables', 'line3_syllables']], axis=1)

    # Define the acceptable range of syllables
    lower_bound = 3
    upper_bound = 9

    # Filter out haikus with syllables outside the acceptable range
    filtered_haikus = df[(mean_syllables >= lower_bound) & (mean_syllables <= upper_bound)]

    # Print the new length of the dataset
    print(f"New length of dataset: {len(filtered_haikus)}")

    # Check for NaN values
    nan_rows = filtered_haikus[filtered_haikus.isna().any(axis=1)]
    print(f"Number of rows with NaN values: {len(nan_rows)}")

    return filtered_haikus

In [13]:
# Process both datasets
filtered_english_haikus = final_process_haikus(english_haikus)
filtered_french_haikus = final_process_haikus(french_haikus)

New length of dataset: 140148
Number of rows with NaN values: 0
New length of dataset: 8300
Number of rows with NaN values: 0


In [14]:
filtered_english_haikus.to_csv('database/final_data/final_english_dataset.csv', index=False)
filtered_french_haikus.to_csv('database/final_data/final_french_dataset.csv', index=False)