### Building Dataset and Text Pre-Processing

In This Section we are trying to achive the following:

1. Making sure we are having only English Movies and english text in the dataset

This step is done by going through the file (movie_lines.txt) which contains the scripts of all the movies and making sure it does not have any non-English alphabetic characters, while allowing (symbols, punctuation, and numbers etc.)

2. Collecting Features list 

In This Step, we do have the following dataset files:

movie_characters_metadata.txt
movie_convesations.txt
movie_lines.txt
movie_titles_metadata.txt
raw_script_urls.txt

Description and Content of Each file:

#### movie_titles_metadata.txt

Content: Information about each movie title

Field Separator: " +++$+++ "

Fields:

- movieID: format of mx where (x) is the movie index. 
- movie title: one or more words and digits representing the movie title
- movie year: digits of the movie release year
- IMDB rating: IMBD rating
- number of IMDB votes: How many IMBD votes were received. 
- genres: list of one or more genre 

#### movie_characters_metadata.txt

Content: Contains information about each movie character

Field Separator: " +++$+++ "

Fields:

- characterID: format of ux where (x) is the character index.
- character name: one or more words of Character first and last name
- movieID: format of mx where (x) is the movie index.
- movie title: one or more words and digits representing the movie title
- gender: Values of (f: female, m: male, ?: for unlabled values
- position in credits: position of character in credits

#### movie_lines.txt
Content: contains the actual text of each utterance

Field Separator: " +++$+++ "

Fields:

- lineID: in format of Lxxxx where (xxxx) the index of that line 
- characterID: who uttered this phrase (format of ux where (x) is the character index.)
- movieID: movie where this line was mentioned
- character name: character name who uttered the phrase (one or more words of Character first and last name)
- text of the utterance: the actual line that part of the script

#### movie_conversations.txt

Content: The structure of the conversations
Field Separator: " +++$+++ "
Fields:

- characterID: The first character involved in the conversation (character ID format)
- characterID: The second character involved in the conversation (character ID format)
- movieID: The movie in which the conversation occurred (movieID format)
- list of the utterances that make the conversation, in chronological order: 'lineID1','lineID2',…,'lineIDN'] (list of Line ID's format).

#### raw_script_urls.txt

Content: contains the actual text of each utterance

Field Separator: " +++$+++ "

Fields:

- movieID: format of mx where (x) is the movie index
- movie title: one or more words and digits representing the movie title
- URL: the urls from which the raw sources were retrieved

#### Features List

We will be using the following attribute names for each of the features. 

- **movie_id:** format of mx where (x) is the movie index. 
- **movie_title:** one or more words and digits representing the movie title
- **movie_year:** digits of the movie release year
- **IMDB_rating:** IMBD rating
- **IMDB_votes:** How many IMBD votes were received. 
- **genres:** list of one or more genre 
- **character_id:** format of ux where (x) is the character index.
- **character_name:** one or more words of Character first and last name
- **gender:** Values of (f: female, m: male, ?: for unlabled values
- **position_credits:** position of character in credits
- **line_id:** in format of Lxxxx where (xxxx) the index of that line 
- **line_text:** the actual line that part of the script
- **character_id1_conv:** The first character involved in the conversation (character ID format)
- **character_id2_conv:** The second character involved in the conversation (character ID format)
- **line_order_conv:** list of the utterances that make the conversation, in chronological order: 'lineID1','lineID2',…,'lineIDN'] (list of Line ID's format).


In [1]:
import re
import os

def has_non_english_letters(filename):
    with open(filename, 'r', encoding='utf-8', errors='ignore') as file:
        text = file.read()
        # Define the pattern for non-English alphabetic characters
        non_english_pattern = re.compile(r'[^\x00-\x7F]+', re.UNICODE)
        # Search for non-English alphabetic characters
        for match in non_english_pattern.finditer(text):
            # Check if the match is a non-alphabetic character (ignore symbols and numbers)
            if re.search(r'[^\W\d_]', match.group()):
                return True
        return False

# Variable to store the directory path
file_path = 'archive/'  # Modify this to the path where your files are located

# List of files to check
files_to_check = [
    'movie_characters_metadata.txt',
    'movie_conversations.txt',
    'movie_lines.txt',
    'movie_titles_metadata.txt'
]

# Iterate over each file and check for non-English letters
for filename in files_to_check:
    full_path = os.path.join(file_path, filename)  # Combine the file path with the filename
    if os.path.exists(full_path):
        if has_non_english_letters(full_path):
            print(f"{filename}: Non-English characters found.")
        else:
            print(f"{filename}: Only English characters found.")
    else:
        print(f"{filename}: File does not exist.")


movie_characters_metadata.txt: Only English characters found.
movie_conversations.txt: Only English characters found.
movie_lines.txt: Only English characters found.
movie_titles_metadata.txt: Only English characters found.


Processing the txt dataset files into into CSV files. and matching attributes with columns as described eariler. for easier use in future part of the code
 

In [2]:
import pandas as pd

# Define the input file and output CSV file names
input_file = 'archive/movie_characters_metadata.txt'
output_file = 'clean_dataset/movie_characters_metadata.csv'

# Define the column names
columns = ['character_id', 'character_name', 'movie_id', 'movie_name', 'gender', 'position_credits']

# Read the text file and split it based on the provided separator
data = []
with open(input_file, 'r', encoding='utf-8', errors='ignore') as file:
    for line in file:
        # Split each line using the separator ' +++$+++ '
        split_line = line.strip().split(' +++$+++ ')
        data.append(split_line)

# Create a pandas DataFrame
df = pd.DataFrame(data, columns=columns)

# Display the first 10 rows of the dataframe
print(df.head(10))

# Save the dataframe to a CSV file
df.to_csv(output_file, index=False)

print(f"CSV file saved as {output_file}")


  character_id character_name movie_id                  movie_name gender  \
0           u0         BIANCA       m0  10 things i hate about you      f   
1           u1          BRUCE       m0  10 things i hate about you      ?   
2           u2        CAMERON       m0  10 things i hate about you      m   
3           u3       CHASTITY       m0  10 things i hate about you      ?   
4           u4           JOEY       m0  10 things i hate about you      m   
5           u5            KAT       m0  10 things i hate about you      f   
6           u6       MANDELLA       m0  10 things i hate about you      f   
7           u7        MICHAEL       m0  10 things i hate about you      m   
8           u8     MISS PERKY       m0  10 things i hate about you      ?   
9           u9        PATRICK       m0  10 things i hate about you      m   

  position_credits  
0                4  
1                ?  
2                3  
3                ?  
4                6  
5                2  
6    

In [3]:
import pandas as pd

# Define the input file and output CSV file names
input_file = 'archive/movie_conversations.txt'
output_file = 'clean_dataset/movie_conversations.csv'

# Define the column names
columns = ['character_id1_conv', 'character_id2_conv', 'movie_id', 'line_order_conv']

# Read the text file and split it based on the provided separator
data = []
with open(input_file, 'r', encoding='utf-8', errors='ignore') as file:
    for line in file:
        # Split each line using the separator ' +++$+++ '
        split_line = line.strip().split(' +++$+++ ')
        data.append(split_line)

# Create a pandas DataFrame
df = pd.DataFrame(data, columns=columns)

# Display the first 10 rows of the dataframe
print(df.head(10))

# Save the dataframe to a CSV file
df.to_csv(output_file, index=False)

print(f"CSV file saved as {output_file}")


  character_id1_conv character_id2_conv movie_id  \
0                 u0                 u2       m0   
1                 u0                 u2       m0   
2                 u0                 u2       m0   
3                 u0                 u2       m0   
4                 u0                 u2       m0   
5                 u0                 u2       m0   
6                 u0                 u2       m0   
7                 u0                 u2       m0   
8                 u0                 u2       m0   
9                 u0                 u2       m0   

                            line_order_conv  
0          ['L194', 'L195', 'L196', 'L197']  
1                          ['L198', 'L199']  
2          ['L200', 'L201', 'L202', 'L203']  
3                  ['L204', 'L205', 'L206']  
4                          ['L207', 'L208']  
5  ['L271', 'L272', 'L273', 'L274', 'L275']  
6                          ['L276', 'L277']  
7                          ['L280', 'L281']  
8            

In [4]:
import pandas as pd

# Define the input file and output CSV file names
input_file = 'archive/movie_lines.txt'
output_file = 'clean_dataset/movie_lines.csv'

# Define the column names
columns = ['line_id', 'character_id', 'movie_id', 'character_name', 'line_text']

# Read the text file and split it based on the provided separator
data = []
with open(input_file, 'r', encoding='utf-8', errors='ignore') as file:
    for line in file:
        # Split each line using the separator ' +++$+++ '
        split_line = line.strip().split(' +++$+++ ')
        data.append(split_line)

# Create a pandas DataFrame
df = pd.DataFrame(data, columns=columns)

# Display the first 10 rows of the dataframe
print(df.head(10))

# Save the dataframe to a CSV file
df.to_csv(output_file, index=False)

print(f"CSV file saved as {output_file}")


  line_id character_id movie_id character_name  \
0   L1045           u0       m0         BIANCA   
1   L1044           u2       m0        CAMERON   
2    L985           u0       m0         BIANCA   
3    L984           u2       m0        CAMERON   
4    L925           u0       m0         BIANCA   
5    L924           u2       m0        CAMERON   
6    L872           u0       m0         BIANCA   
7    L871           u2       m0        CAMERON   
8    L870           u0       m0         BIANCA   
9    L869           u0       m0         BIANCA   

                                           line_text  
0                                       They do not!  
1                                        They do to!  
2                                         I hope so.  
3                                          She okay?  
4                                          Let's go.  
5                                                Wow  
6     Okay -- you're gonna need to learn how to lie.  
7        

In [5]:
import pandas as pd

# Define the input file and output CSV file names
input_file = 'archive/movie_titles_metadata.txt'
output_file = 'clean_dataset/movie_titles_metadata.csv'

# Define the column names
columns = ['movie_id', 'movie_title', 'movie_year', 'IMDB_rating', 'IMDB_votes', 'genres']

# Read the text file and split it based on the provided separator
data = []
with open(input_file, 'r', encoding='utf-8', errors='ignore') as file:
    for line in file:
        # Split each line using the separator ' +++$+++ '
        split_line = line.strip().split(' +++$+++ ')
        data.append(split_line)

# Create a pandas DataFrame
df = pd.DataFrame(data, columns=columns)

# Display the first 10 rows of the dataframe
print(df.head(10))

# Save the dataframe to a CSV file
df.to_csv(output_file, index=False)

print(f"CSV file saved as {output_file}")


  movie_id                                    movie_title movie_year  \
0       m0                     10 things i hate about you       1999   
1       m1                     1492: conquest of paradise       1992   
2       m2                                     15 minutes       2001   
3       m3                          2001: a space odyssey       1968   
4       m4                                        48 hrs.       1982   
5       m5                              the fifth element       1997   
6       m6                                            8mm       1999   
7       m7  a nightmare on elm street 4: the dream master       1988   
8       m8     a nightmare on elm street: the dream child       1989   
9       m9                           the atomic submarine       1959   

  IMDB_rating IMDB_votes                                             genres  
0        6.90      62847                              ['comedy', 'romance']  
1        6.20      10421     ['adventure', 'biograp

Merging the Datasets into final Results:

1. movie_all_data.csv where it will contain all movies information
2. movie_conversations.csv will not be changeed as it will have the conversation sequence. 


In [2]:
# Load the CSV files
movie_titles_metadata = pd.read_csv('clean_dataset/movie_titles_metadata.csv')
movie_characters_metadata = pd.read_csv('clean_dataset/movie_characters_metadata.csv')
movie_lines = pd.read_csv('clean_dataset/movie_lines.csv')

# Phase 1: Merge movie_characters_metadata with movie_titles_metadata on movie_id
merge_phase1 = pd.merge(movie_characters_metadata, 
                        movie_titles_metadata[['movie_id', 'movie_year', 'IMDB_rating', 'IMDB_votes', 'genres']], 
                        on='movie_id', 
                        how='left')

# Phase 2: Merge the result from phase 1 with movie_lines on character_id
movie_all_data = pd.merge(merge_phase1, 
                          movie_lines[['character_id', 'line_id', 'line_text']], 
                          on='character_id', 
                          how='left')

# Save the final merged data to a CSV file
output_file = 'clean_dataset/movie_all_data.csv'
movie_all_data.to_csv(output_file, index=False)