# Text Translation and Sentiment Analysis using Transformers

## Project Overview:

The objective of this project is to analyze the sentiment of movie reviews in three different languages - English, French, and Spanish. We have been given 30 movies, 10 in each language, along with their reviews and synopses in separate CSV files named `movie_reviews_eng.csv`, `movie_reviews_fr.csv`, and `movie_reviews_sp.csv`.

- The first step of this project is to convert the French and Spanish reviews and synopses into English. This will allow us to analyze the sentiment of all reviews in the same language. We will be using pre-trained transformers from HuggingFace to achieve this task.

- Once the translations are complete, we will create a single dataframe that contains all the movies along with their reviews, synopses, and year of release in all three languages. This dataframe will be used to perform sentiment analysis on the reviews of each movie.

- Finally, we will use pretrained transformers from HuggingFace to analyze the sentiment of each review. The sentiment analysis results will be added to the dataframe. The final dataframe will have 30 rows


The output of the project will be a CSV file with a header row that includes column names such as **Title**, **Year**, **Synopsis**, **Review**, **Review Sentiment**, and **Original Language**. The **Original Language** column will indicate the language of the review and synopsis (*en/fr/sp*) before translation. The dataframe will consist of 30 rows, with each row corresponding to a movie.

In [2]:
# imports
import pandas as pd
from transformers import MarianMTModel, MarianTokenizer
from transformers import pipeline

  from .autonotebook import tqdm as notebook_tqdm


### Get data from `.csv` files and then preprocess data

In [7]:
# use the `pd.read_csv()` function to read the movie_review_*.csv files into 3 separate pandas dataframes

# Note: All the dataframes would have different column names. For testing purposes
# you should have the following column names/headers -> [Title, Year, Synopsis, Review]

def preprocess_data() -> pd.DataFrame:
    """
    Reads movie data from .csv files, map column names, add the "Original Language" column,
    and finally concatenate in one resultant dataframe called "df".
    """
    # 1. read data from csv files
    df_eng = pd.read_csv("data/movie_reviews_eng.csv")
    df_fr = pd.read_csv("data/movie_reviews_fr.csv")
    df_sp = pd.read_csv("data/movie_reviews_sp.csv")
    
    # 2. rename columns names to be [Title, Year, Synopsis, Review]
    df_eng.rename(mapper= {'Movie / TV Series': 'Title',
                       'Year':'Year',
                      'Synopsis':'Synopsis',
                      'Review':'Review'},
              axis=1, 
              inplace=True)
    
    df_fr.rename(mapper= {'Titre': 'Title',
                       'Année':'Year',
                      'Synopsis':'Synopsis',
                      'Critiques':'Review'},
              axis=1, 
              inplace=True)
    
    df_sp.rename(mapper= {'Título': 'Title',
                       'Año':'Year',
                      'Sinopsis':'Synopsis',
                      'Críticas':'Review'},
              axis=1, 
              inplace=True)
    
    # 3. add "Original Language" column
    df_eng['Original Language'] = 'eng'
    df_fr['Original Language'] = 'fr'
    df_sp['Original Language'] = 'sp'
    
    # 4. combine dataframes into one
    df = pd.concat([df_eng, df_fr, df_sp], ignore_index=True)
    
    # return dataframe
    return df

df = preprocess_data()

In [20]:
df.sample(10)

Unnamed: 0,Title,Year,Synopsis,Review,Original Language
29,El Incidente,2014,"En esta película de terror mexicana, un grupo ...","""El Incidente es una película aburrida y sin s...",sp
15,Le Dîner de Cons,1998,Le film suit l'histoire d'un groupe d'amis ric...,"""Je n'ai pas aimé ce film du tout. Le concept ...",fr
11,Intouchables,2011,Ce film raconte l'histoire de l'amitié improba...,"""Intouchables est un film incroyablement touch...",fr
14,Le Fabuleux Destin d'Amélie Poulain,2001,Cette comédie romantique raconte l'histoire d'...,"""Le Fabuleux Destin d'Amélie Poulain est un fi...",fr
19,Babylon A.D.,2008,"Dans un futur lointain, un mercenaire doit esc...","""Ce film est un gâchis complet. Les personnage...",fr
6,Scott Pilgrim vs. the World,2010,Scott Pilgrim (Michael Cera) must defeat his n...,"""It was difficult to sit through the whole thi...",eng
8,Solo: A Star Wars Story,2018,A young Han Solo (Alden Ehrenreich) joins a gr...,"""Dull and pointless, with none of the magic of...",eng
4,Inception,2010,Dom Cobb (Leonardo DiCaprio) is a skilled thie...,"""Inception is a mind-bending and visually stun...",eng
22,Y tu mamá también,2001,Dos amigos adolescentes (Gael García Bernal y ...,"""Y tu mamá también es una película que se qued...",sp
9,The Island,2005,In a future where people are cloned for organ ...,"""The Island is a bland and forgettable sci-fi ...",eng


### Text translation

Translate the **Review** and **Synopsis** column values to English.

In [21]:
# load translation models and tokenizers
fr_en_model_name = "Helsinki-NLP/opus-mt-fr-en"
es_en_model_name = "Helsinki-NLP/opus-mt-es-en"
fr_en_model = MarianMTModel.from_pretrained(fr_en_model_name)
es_en_model = MarianMTModel.from_pretrained(es_en_model_name)
fr_en_tokenizer = MarianTokenizer.from_pretrained(fr_en_model_name)
es_en_tokenizer = MarianTokenizer.from_pretrained(es_en_model_name)

def translate(text: str, model, tokenizer) -> str:
    """
    function to translate a text using a model and tokenizer
    """
    # encode the text using the tokenizer
    inputs = tokenizer([text], return_tensors="pt")

    # generate the translation using the model
    outputs = model.generate(**inputs)

    # decode the generated output and return the translated text
    decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    return decoded

Downloading config.json: 1.42kB [00:00, 3.72MB/s]
Downloading pytorch_model.bin: 100%|██████████| 301M/301M [00:01<00:00, 152MB/s]  
Downloading generation_config.json: 100%|██████████| 293/293 [00:00<00:00, 1.72MB/s]
Downloading config.json: 1.44kB [00:00, 2.49MB/s]
Downloading pytorch_model.bin: 100%|██████████| 312M/312M [00:01<00:00, 204MB/s] 
Downloading generation_config.json: 100%|██████████| 293/293 [00:00<00:00, 1.35MB/s]
Downloading source.spm: 100%|██████████| 802k/802k [00:00<00:00, 10.3MB/s]
Downloading target.spm: 100%|██████████| 778k/778k [00:00<00:00, 79.6MB/s]
Downloading vocab.json: 1.34MB [00:00, 148MB/s]
Downloading tokenizer_config.json: 100%|██████████| 42.0/42.0 [00:00<00:00, 243kB/s]
Downloading source.spm: 100%|██████████| 826k/826k [00:00<00:00, 142MB/s]
Downloading target.spm: 100%|██████████| 802k/802k [00:00<00:00, 216MB/s]
Downloading vocab.json: 1.59MB [00:00, 155MB/s]
Downloading tokenizer_config.json: 100%|██████████| 44.0/44.0 [00:00<00:00, 175kB/s]


In [22]:
# test
translate("où est l'arrêt de bus ?", fr_en_model, fr_en_tokenizer)



"Where's the bus stop?"

In [23]:
# filter reviews in French and translate to English
fr_reviews = df.loc[df['Original Language'] == 'fr', 'Review']

fr_reviews_en = []
for reivew in fr_reviews:
    fr_reviews_en.append(translate(reivew, fr_en_model, fr_en_tokenizer))

fr_reviews_en  

['"The Land is an absolutely beautiful film with songs that stay in the head for days. The actors are incredible and their alchemy is palpable. The dance scenes are absolutely dazzling and the story is touching and authentic."',
 '"Untouchables is an incredibly touching film with incredible actors and an inspiring story. The jokes are smart and never offensive, and the emotion is perfectly dosed. It is a film that will make you laugh and cry, and that will remind you of the importance of friendship and compassion."',
 '"Amélie is an absolutely charming film that will make you smile from beginning to end. The aesthetics of the film is beautiful and imaginative, and the music is enchanting. Audrey Tautou is incredibly charismatic in the title role, and the story is full of touching moments and unforgettable characters."',
 '"The Choristes are a beautiful film that will make you laugh and cry. The music is absolutely moving and the performances are incredible, especially that of young act

In [24]:
# filter synopsis in French and translate to English
fr_synopsis = df.loc[df['Original Language'] == 'fr', 'Synopsis']

fr_synopsis_en = []
for synopsis in fr_synopsis:
    fr_synopsis_en.append(translate(synopsis, fr_en_model, fr_en_tokenizer))

fr_synopsis_en

['This musical tells the story of a budding actress and jazz pianist who fall in love in Los Angeles. The film was hailed for its enchanting music, dazzling dance scenes and exceptional performances.',
 'This film tells the story of the unlikely friendship between a tetraplegic man and his home helper, a young man from the suburbs. The film was hailed for his intelligent humour, his moving performances and his universal message of friendship and understanding.',
 'This romantic comedy tells the story of Amélie, a shy and dreamy young woman who decides to change the lives of people around her. The film was hailed for her colorful and imaginative aesthetic, her enchanting original tape and the charismatic performance of Audrey Tautou.',
 'This film tells the story of a music teacher who tries to change the lives of the difficult students of a boarding school for children in difficulty. The film was hailed for his moving music, his dazzling performances and his ability to reach the heart 

In [25]:
# filter reviews in Spanish and translate to English
es_reviews = df.loc[df['Original Language'] == 'sp', 'Review']

es_reviews_en = []
for reivew in es_reviews:
    es_reviews_en.append(translate(reivew, es_en_model, es_en_tokenizer))

es_reviews_en

['"Rome is a beautiful and moving film that pays tribute to the lives of domestic workers in Mexico. Cinematography is impressive and Yalitza Aparicio\'s performance is exceptional."',
 '"The Paper House is an exciting and addictive series that will keep you on edge from start to finish. The characters are complex and well developed, and the plot is smart and surprising."',
 '"And your mom is also a movie that stays with you long after it ends. Alfonso Cuarón\'s direction is masterful, and the performance of the three protagonists is excellent."',
 '"The Labyrinth of Fauno is a fascinating and emotional film that combines the reality of the Spanish postwar period with elements of fantasy. Guillermo del Toro\'s direction is impressive, and Ivana Baquero\'s performance is moving."',
 '"Amores dogs is an intense and moving film that will keep you glued to the screen. Alejandro González Iñárritu\'s direction is masterful, and the actors\' performance is impressive."',
 '"Red Eagle is a bor

In [26]:
# filter synopsis in Spanish and translate to English
es_synopsis = df.loc[df['Original Language'] == 'sp', 'Synopsis']

es_synopsis_en = []
for synopsis in es_synopsis:
    es_synopsis_en.append(translate(synopsis, es_en_model, es_en_tokenizer))

es_synopsis_en

['Cleo (Yalitza Aparicio) is a young domestic worker who works for a middle-class family in Mexico City during the 1970s. The film follows her daily life and relationships with her family and her community.',
 'This Spanish television series follows a group of thieves planning an ambitious robbery at the National Currency and Timbre Factory. The series became an international success and was broadcast worldwide via Netflix.',
 'Two teenage friends (Gael García Bernal and Diego Luna) embark on a road trip with an older woman (Maribel Verdú) they have just met. The film addresses issues such as friendship, sexuality and mortality.',
 'During the Spanish postwar period, Ofelia (Ivana Baquero) moves to a rural area with her mother and stepfather, a captain of the Francoist army. There, she discovers a fantastic world populated by magical creatures.',
 'Three stories intertwine in this Mexican film: a man who engages in dog fights, a model who suffers a car accident and a woman who abandons

In [27]:
len(fr_reviews), len(fr_reviews_en), len(fr_synopsis), len(fr_synopsis_en), len(es_reviews), len(es_reviews_en), len(es_synopsis), len(es_synopsis_en)

(10, 10, 10, 10, 10, 10, 10, 10)

In [28]:
# update dataframe with translated text
# add the translated reviews and synopsis - you can overwrite the existing data

# french
condition = df['Original Language'] == 'fr'
indices = df[condition].index[:]
df.loc[indices, 'Review'] = fr_reviews_en
df.loc[indices, 'Synopsis'] = fr_synopsis_en

# spanish
condition = df['Original Language'] == 'sp'
indices = df[condition].index[:]
df.loc[indices, 'Review'] = es_reviews_en
df.loc[indices, 'Synopsis'] = es_synopsis_en

In [29]:
df.sample(10)

Unnamed: 0,Title,Year,Synopsis,Review,Original Language
11,Intouchables,2011,This film tells the story of the unlikely frie...,"""Untouchables is an incredibly touching film w...",fr
10,La La Land,2016,This musical tells the story of a budding actr...,"""The Land is an absolutely beautiful film with...",fr
1,The Dark Knight,2008,Batman (Christian Bale) teams up with District...,"""The Dark Knight is a thrilling and intense su...",eng
27,El Bar,2017,A group of people are trapped in a bar after M...,"""The Bar is a ridiculous and meaningless film ...",sp
12,Amélie,2001,This romantic comedy tells the story of Amélie...,"""Amélie is an absolutely charming film that wi...",fr
18,Les Visiteurs en Amérique,2000,In this continuation of the French comedy The ...,"""The film is a total waste of time. The jokes ...",fr
14,Le Fabuleux Destin d'Amélie Poulain,2001,This romantic comedy tells the story of Amélie...,"""The Fabulous Destiny of Amélie Poulain is an ...",fr
24,Amores perros,2000,Three stories intertwine in this Mexican film:...,"""Amores dogs is an intense and moving film tha...",sp
29,El Incidente,2014,"In this Mexican horror film, a group of people...","""The Incident is a boring and frightless film ...",sp
20,Roma,2018,Cleo (Yalitza Aparicio) is a young domestic wo...,"""Rome is a beautiful and moving film that pays...",sp


### Sentiment Analysis

Use HuggingFace pretrained model for sentiment analysis of the reviews. Store the sentiment result **Positive** or **Negative** in a new column titled **Sentiment** in the dataframe.

In [30]:
# load sentiment analysis model
model_name = "text-classification"
sentiment_classifier = pipeline(model_name)

def analyze_sentiment(text, classifier):
    """
    function to perform sentiment analysis on a text using a model
    """
    return classifier(text)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [31]:
# test
analyze_sentiment("This restaurant is awesome", sentiment_classifier)

[{'label': 'POSITIVE', 'score': 0.9998743534088135}]

In [40]:
# perform sentiment analysis on reviews and store results in new column
reviews_sentiment_analysis = analyze_sentiment(df['Review'].tolist(), sentiment_classifier)
reviews_sentiment_analysis_df = pd.DataFrame(reviews_sentiment_analysis, columns=['label', 'score'])
df['Sentiment'] = reviews_sentiment_analysis_df['label']

In [41]:
df.sample(10)

Unnamed: 0,Title,Year,Synopsis,Review,Original Language,Sentiment
2,Forrest Gump,1994,Forrest Gump (Tom Hanks) is a simple man with ...,"""Forrest Gump is a heartwarming and inspiratio...",eng,POSITIVE
4,Inception,2010,Dom Cobb (Leonardo DiCaprio) is a skilled thie...,"""Inception is a mind-bending and visually stun...",eng,POSITIVE
1,The Dark Knight,2008,Batman (Christian Bale) teams up with District...,"""The Dark Knight is a thrilling and intense su...",eng,POSITIVE
27,El Bar,2017,A group of people are trapped in a bar after M...,"""The Bar is a ridiculous and meaningless film ...",sp,NEGATIVE
0,The Shawshank Redemption,1994,"Andy Dufresne (Tim Robbins), a successful bank...","""The Shawshank Redemption is an inspiring tale...",eng,POSITIVE
24,Amores perros,2000,Three stories intertwine in this Mexican film:...,"""Amores dogs is an intense and moving film tha...",sp,POSITIVE
9,The Island,2005,In a future where people are cloned for organ ...,"""The Island is a bland and forgettable sci-fi ...",eng,NEGATIVE
6,Scott Pilgrim vs. the World,2010,Scott Pilgrim (Michael Cera) must defeat his n...,"""It was difficult to sit through the whole thi...",eng,NEGATIVE
28,Torrente: El brazo tonto de la ley,1998,"In this Spanish comedy, a corrupt cop (played ...","""Torrente is a vulgar and offensive film that ...",sp,NEGATIVE
12,Amélie,2001,This romantic comedy tells the story of Amélie...,"""Amélie is an absolutely charming film that wi...",fr,POSITIVE


In [43]:
import os

outdir = './result'
if not os.path.exists(outdir):
    os.mkdir(outdir)

In [44]:
# export the results to a .csv file
df.to_csv("result/reviews_with_sentiment.csv", index=False)