1. Create a machine learning model that can predict the genre of a movie based on its plot summary or other textual information. You can use techniques like TF-IDF or word embeddings with classifiers such as Naive Bayes, Logistic Regression, or Support Vector Machines.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

# Load dataset from text file
with open('train_data.txt', 'r', encoding='utf-8') as file:
    lines = file.readlines()

# Extract features (plot summaries) and labels (genres) from the text file
features = []
labels = []
for line in lines:
    parts = line.strip().split('\t')
    if len(parts) >= 2:  # Check if there are at least two elements in the line
        features.append(parts[0])  # Assuming the plot summary is in the first column
        labels.append(parts[1])    # Assuming the genre is in the second column
    else:
        print("Skipping line with insufficient elements:", line)

# Check if there are any samples available after filtering
if len(features) == 0:
    print("No samples available after filtering. Please check your dataset.")
else:
    # Convert features and labels into a DataFrame
    data = pd.DataFrame({'plot': features, 'genre': labels})

    # Split the dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(data['plot'], data['genre'], test_size=0.2, random_state=42)

    # Initialize TF-IDF Vectorizer
    tfidf_vectorizer = TfidfVectorizer(max_features=5000)

    # Fit and transform the training set
    X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

    # Transform the test set
    X_test_tfidf = tfidf_vectorizer.transform(X_test)

    # Initialize Support Vector Machine Classifier
    svm_classifier = SVC(kernel='linear')

    # Train the classifier
    svm_classifier.fit(X_train_tfidf, y_train)

    # Predictions
    y_pred = svm_classifier.predict(X_test_tfidf)

    # Evaluate the model
    accuracy = accuracy_score(y_test, y_pred)
    print("Accuracy:", accuracy)
    print("Classification Report:")
    print(classification_report(y_test, y_pred))


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Skipping line with insufficient elements: 1443 ::: In Jackson Heights (2015) ::: documentary ::: Jackson Heights, Queens, New York City is one of the most ethnically and culturally diverse communities in the United States and the world. There are immigrants from every country in South America, Mexico, Bangladesh, Pakistan, Afghanistan, India and China. Some are citizens, some have green cards, some are without documents. The people who live in Jackson Heights, in their cultural, racial and ethnic diversity, are representative of the new wave of immigrants to America. 167 languages are spoken in Jackson Heights. Some of the issues the film raises-assimilation, integration, immigration and cultural and religious differences-are common to all the major cities of the Western world. The subject of the film is the daily life of the people in this community-their businesses, community centers, religions, and political, cultural 

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Skipping line with insufficient elements: 14273 ::: Mein Vater der Wald (2011) ::: documentary ::: Jerzy Jurek, born 1929, worked for about 40 years in the polish forester's house. Melioration and planing new forest paths, were the main fields of his work. That for he had has to walk almost 20 km a day, for 6 months a year, for 40 years. "My Father the Forst" is a portrait of Jerzy Jurek, his work and his childhood during the Second World War. He experienced forced labor, the time which punch his whole life.

Skipping line with insufficient elements: 14274 ::: Con Fidel, pase lo que pase (2012) ::: documentary ::: Sierra Maestra, Cuba, 850 km east of Havana. The day before the celebration of the 52nd anniversary of the Revolution. An old man is repairing a few decades old motorcycle. A young dentist is trying to find some transport to a clinic in remote mountains. A middle-aged married couple run a public telephone booth 

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



22702 ::: A Nymph of the Waves (1900) ::: short ::: A woman in ballet slippers wearing a large white hat and a long white dress - with ruffles, puffy sleeves and petticoats - dances across water with roiling waves behind her. She holds the edges of the skirt with her hands, lifting and twirling, sometimes exposing her bloomers and a dark garter on one leg. Her style combines ballet with the exuberant kicks and twirls of a burlesque dance hall. With churning waves behind her, the water seems to wash beneath her feet. The film of the dancer, "M'lle. Cathrina Bartho" (1899), is superimposed on that of the water, "Upper Rapids, from Bridge" (1896).

Skipping line with insufficient elements: 22703 ::: Bondage Surprise (2010) ::: adventure ::: Six vignettes of woman captured and placed in bondage. In the first story, four women, ready themselves for a party when a home invader binds and gags and leaves them squirming in the bedroom. Next a naked blonde finds herself tied and gagged outside o

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



 31307 ::: Afscheid van de Maan (2014) ::: drama ::: It's the hot summer of 1972. On the 9th floor of a tower block on the outskirts of a Dutch provincial town the sixties finally kicked in. Change is in the air and actually materializes the moment a new resident, an artist named Loes, moves in. Soon everything will change for the family of the 12-year old Duch. To his dad, Bob, the new neighbor symbolizes all his doubts about his plotless existence. She is both adventurous and eccentric and almost without a second thought he decides to move in with Loes and her daughter, two apartments away from his old family. The children don't fully realize the drama unfolding. While their parents try to rediscover themselves, or try to preserve what was once theirs, the children focus on the future outside. Maybe truly perceiving things as they are. Duch, the son of the family, has two big passions in life. The manned mission to the moon and the dreamy, beautiful, Valium addicted neighbor 'aunt' M

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



 48377 ::: It Happened in Beijing (????) ::: comedy ::: On a bet, a Gossip Journalist looking for a regular guy, trades interview assignments with a Science Journalist - only to discover the biggest Gossip story of her career, a gloriously charismatic doctor visiting Beijing who, through a link of royal scandals, has been reluctantly bumped next in line to be King. On a bet, a Gossip Journalist trades interview assignments with a Science Journalist and falls in love with her subject, a noble International Surgeon visiting Beijing - only to find herself in the greatest gossip story of her career when a scandal bumps him next in line to be King and he doesn't want it.

Skipping line with insufficient elements: 48378 ::: Space Pirates (2014) ::: sci-fi ::: In a galaxy, a group of aliens are sent on a secret mission to save their planet from a black hole, upon arriving at their destination, an asteroid collides with their vessel causing the crew to crash land on Earth. The aliens live unde

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Skipping line with insufficient elements: 51687 ::: Shadows and Fog (1991) ::: comedy ::: A small and insignificant bookkeeper, Kleinman, is awoken one night by his neighbors who wants his help to track down a strangler who has been killing people all over town. The citizens form vigilance committees, but when Kleinman has dressed, his neighbors have disappeared. Meanwhile a circus has come to town. Irmy and Paul are two of the artists. After a fight, Irmy leaves the circus in the middle of the night. Eventually she meets Kleinman, scared and alone. It's the middle of a dark and foggy night. Max Kleinman is awaken from a deep sleep by a group of men wanting him to join their lynch mob to capture a serial murderer, who only seems to strike when foggy and who primarily murders by strangulation with a wire cord. Kleinman is included within the plan even though he is largely treated as a schmuck by those that know him, the members of the mob who don't even tell him what his role in the pla

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

# Load train and test data
train_data = pd.read_csv("train_data.txt", sep=":::", names=["ID", "TITLE", "GENRE", "DESCRIPTION"], engine='python')
test_data = pd.read_csv("test_data.txt", sep=":::", names=["ID", "TITLE", "DESCRIPTION"], engine='python')

# Vectorize text data
vectorizer = TfidfVectorizer(max_features=10000, stop_words='english')
X_train = vectorizer.fit_transform(train_data["DESCRIPTION"])
X_test = vectorizer.transform(test_data["DESCRIPTION"])

# Train logistic regression model
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train, train_data["GENRE"])

# Predict genre for test data
predictions = log_reg.predict(X_test)

# Output predictions
output_df = pd.DataFrame({"ID": test_data["ID"], "GENRE": predictions})
output_df.to_csv("predictions.csv", index=False)
