# Load IMDb dataset

The IMDb Top 1000 dataset contains the following columns:

1. `Poster_Link`: Link to the movie poster.
2. `Series_Title`: Title of the movie.
3. `Released_Year`: Year of release.
4. `Certificate`: Movie certification.
5. `Runtime`: Duration of the movie.
6. `Genre`: Genre(s) of the movie.
7. `IMDB_Rating`: IMDb rating of the movie.
8. `Overview`: Brief summary of the movie.
9. `Meta_score`: Metacritic score of the movie.
10. `Director`: Director of the movie.
11. `Star1` to `Star4`: Leading stars of the movie.
12. `No_of_Votes`: Number of votes received on IMDb.
13. `Gross`: Gross revenue.







In [42]:
import pandas as pd
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time
import seaborn as sns
from sklearn.model_selection import KFold

# Reading the uploaded file
file_path = 'imdb_top_1000.csv'
imdb_df = pd.read_csv(file_path)
imdb_df.head()


Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000


In [38]:
# Checking data types and missing values
data_types = imdb_df.dtypes
missing_values = imdb_df.isna().sum()

data_types, missing_values


(Poster_Link       object
 Series_Title      object
 Released_Year     object
 Certificate       object
 Runtime           object
 Genre             object
 IMDB_Rating      float64
 Overview          object
 Meta_score       float64
 Director          object
 Star1             object
 Star2             object
 Star3             object
 Star4             object
 No_of_Votes        int64
 Gross             object
 dtype: object,
 Poster_Link        0
 Series_Title       0
 Released_Year      0
 Certificate      101
 Runtime            0
 Genre              0
 IMDB_Rating        0
 Overview           0
 Meta_score       157
 Director           0
 Star1              0
 Star2              0
 Star3              0
 Star4              0
 No_of_Votes        0
 Gross            169
 dtype: int64)

The IMDb Top 1000 dataset consists of various data types, with a mix of numerical, categorical, and textual data. Here's a summary of the data types and missing values:

- `Poster_Link`, `Series_Title`, `Released_Year`, `Certificate`, `Runtime`, `Genre`, `Overview`, `Director`, `Star1`, `Star2`, `Star3`, `Star4`, and `Gross` are of type `object`, indicating they are strings or categorical data.
- `IMDB_Rating` and `Meta_score` are `float64`, representing numerical data.
- `No_of_Votes` is an `int64`, also numerical.

Regarding missing values:
- `Certificate` has 101 missing values.
- `Meta_score` has 157 missing values.
- `Gross` has 169 missing values.

For the LazyFCA approach, we need to identify a target variable and preprocess the features accordingly. Potential target variables could be the `IMDB_Rating` or `Meta_score`. However, these are continuous variables, and LazyFCA typically works with categorical targets. We could convert them into categories based on certain thresholds.

In [39]:
from sklearn.preprocessing import LabelEncoder

# Helper function to convert runtime to numeric
def convert_runtime(runtime):
    return int(runtime.split()[0])

# Helper function to categorize IMDB ratings
def categorize_rating(rating):
    if rating < 6.0:
        return 'Low'
    elif rating < 8.0:
        return 'Medium'
    return 'High'

# Preprocessing
# Convert 'Runtime' to numeric
imdb_df['Runtime'] = imdb_df['Runtime'].apply(convert_runtime)

# Categorize 'IMDB_Rating'
imdb_df['IMDB_Rating_Category'] = imdb_df['IMDB_Rating'].apply(categorize_rating)

# Fill missing values
imdb_df['Certificate'].fillna('Unknown', inplace=True)
imdb_df['Meta_score'].fillna(imdb_df['Meta_score'].median(), inplace=True)
imdb_df['Gross'].fillna('0', inplace=True)

# Convert 'Gross' to numeric by removing non-numeric characters
imdb_df['Gross'] = imdb_df['Gross'].str.replace(r'[^\d.]', '', regex=True).astype(float)

# Label encoding for categorical variables
label_cols = ['Certificate', 'Genre', 'Director', 'Star1', 'Star2', 'Star3', 'Star4']
label_encoders = {col: LabelEncoder().fit(imdb_df[col]) for col in label_cols}
for col, le in label_encoders.items():
    imdb_df[col+'_encoded'] = le.transform(imdb_df[col])

# Selecting relevant columns for LazyFCA
relevant_cols = ['Released_Year', 'Runtime', 'Meta_score', 'No_of_Votes', 'Gross'] + [col+'_encoded' for col in label_cols]
target_col = 'IMDB_Rating_Category'

# Final dataset for analysis
imdb_fca_df = imdb_df[relevant_cols + [target_col]]
imdb_fca_df.head()


Unnamed: 0,Released_Year,Runtime,Meta_score,No_of_Votes,Gross,Certificate_encoded,Genre_encoded,Director_encoded,Star1_encoded,Star2_encoded,Star3_encoded,Star4_encoded,IMDB_Rating_Category
0,1994,142,80.0,2343110,28341469.0,1,137,141,599,568,89,912,High
1,1972,175,100.0,1620367,134966411.0,1,122,137,417,9,336,194,High
2,2008,152,84.0,2303232,534858444.0,14,22,83,128,283,1,620,High
3,1974,202,90.0,1129952,57300000.0,1,122,137,9,657,704,194,High
4,1957,96,96.0,689845,4360000.0,12,122,456,252,464,535,421,High




1. `Runtime` has been converted to a numeric value (in minutes).
2. `IMDB_Rating` has been categorized into 'Low', 'Medium', and 'High' categories, based on the rating value.
3. Missing values in `Certificate`, `Meta_score`, and `Gross` have been handled:
   - `Certificate` missing values are filled with 'Unknown'.
   - `Meta_score` missing values are filled with the median value.
   - `Gross` missing values are set to 0.
4. `Gross` has been converted to a numeric value.
5. Label encoding has been applied to `Certificate`, `Genre`, `Director`, `Star1` to `Star4`.
6. The final dataset includes these features: 'Released_Year', 'Runtime', 'Meta_score', 'No_of_Votes', 'Gross', and the encoded categorical variables. The target variable is `IMDB_Rating_Category`.



In [40]:
def LazyFCA(X, y, cod, cv=5, min_supp=1, ranged=False, gap=None):
  """Performs a lazy classification and computes CV score

  Parameters
  ----------
  X : List
      Data features
  y : List
      Target feature
  cod : List
      Type of coding of features
      -c categorical
      -i interval intersection
  cv : int
      Number of folds in k-fold CV
  min_supp ; int
      Minimal support of hypothesis
  ranged : bool
      If classes are ordered
  gap : int
      Maximum length of interval of classification

  Returns
  -------
  prediction: List
      Class predictions for objects
  acc: float
      Accuracy on CV
  """
  y = np.array(y)
  kf = KFold(n_splits=cv, random_state=None, shuffle=False)
  prediction = [None] * len(y)
  if ranged:
    acc=[]
    for train_index, test_index in kf.split(X):
      for test in test_index:#outer loop through test examples
        for tr in train_index:#first inner loop through hypotheses
          hyp = [None] * len(cod)
          for i in range(len(cod)):#creating hypothesis
            if cod[i] == 'i':
              hyp[i] = [min(X.iloc[test][i], X.iloc[tr][i]), max(X.iloc[test][i], X.iloc[tr][i])]
            elif cod[i] == 'c':
              hyp[i] = X.iloc[test][i] == X.iloc[tr][i]
          pred_int = [y[tr]]
          for htr in train_index:#second inner loop to check hypothesis
            for i in range(len(cod)):#checing on a single example
              if (cod[i] == 'i' and not(hyp[i][0] <= X.iloc[htr][i] <= hyp[i][1])) or (cod[i] == 'c' and hyp[i] == True and X.iloc[htr][i] != X.iloc[test][i]):
                break
              elif i == len(cod)-1 and htr != tr:
                pred_int.append(y[htr])
            if (max(pred_int)-min(pred_int)+1) > gap:
              break
            elif htr == train_index[-1] and len(pred_int) >= min_supp:
              prediction[test] = pred_int
          if prediction[test] != None:
            break
      right = 0
      for p in test_index:
        if prediction[p]!= None and min(prediction[p]) <= y[p] <= max(prediction[p]):
          right += 1
      acc.append(right/len(test_index))

  else:
    acc=[]
    for train_index, test_index in kf.split(X):
      for test in test_index:#outer loop through test examples
        for tr in train_index:#first inner loop through hypotheses
          hyp = [None] * len(cod)
          for i in range(len(cod)):#creating hypothesis
            if cod[i] == 'i':
              hyp[i] = [min(X.iloc[test][i], X.iloc[tr][i]), max(X.iloc[test][i], X.iloc[tr][i])]
            elif cod[i] == 'c':
              hyp[i] = X.iloc[test][i] == X.iloc[tr][i]
          pred_int = [y[tr]]
          for htr in train_index:#second inner loop to check hypothesis
            for i in range(len(cod)):#checing on a single example
              if (cod[i] == 'i' and not(hyp[i][0] <= X.iloc[htr][i] <= hyp[i][1])) or (cod[i] == 'c' and hyp[i] == True and X.iloc[htr][i] != X.iloc[test][i]):
                break
              elif i == len(cod)-1 and htr != tr:
                pred_int.append(y[htr])
            if pred_int[-1] != pred_int[0]:
              break
            elif htr == train_index[-1] and len(pred_int) >= min_supp:
              prediction[test] = pred_int
          if prediction[test] != None:
            break
      right = 0
      for p in test_index:
        if prediction[p]!= None and y[p] == prediction[p][0]:
          right += 1
      acc.append(right/len(test_index))

  unclass = 0
  for p in range(len(y)):
    if prediction[p] == None:
      unclass += 1
  unclass /= len(y)


  return prediction, acc, np.mean(acc), unclass

In [41]:
# Coding type for each feature ('i' for interval, 'c' for categorical)
feature_coding = ['i'] * len(relevant_cols)  # Assuming all features as interval type for simplicity

# Applying LazyFCA
# Splitting features (X) and target (y)
X_imdb = imdb_fca_df[relevant_cols]
y_imdb = imdb_fca_df[target_col]

# Applying LazyFCA - we'll start with a small subset due to computational intensity
toy_x_imdb = X_imdb.iloc[:50]
toy_y_imdb = y_imdb.iloc[:50]

# Applying the LazyFCA function to the toy dataset
pred_imdb, acc_imdb, mean_acc_imdb, uc_imdb = LazyFCA(toy_x_imdb, toy_y_imdb, feature_coding, cv=5)
pred_imdb, acc_imdb, mean_acc_imdb, uc_imdb


([['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High'],
  ['High']],
 [1.0, 1.0, 1.0, 1.0, 1.0],
 1.0,
 0.0)

Based on the output you've provided, the LazyFCA function has been successfully applied to a subset of the IMDb dataset. Let's interpret the results:

1. **Predictions (`pred_imdb`)**: The predictions are all 'High', which suggests that for each instance in the subset, the LazyFCA algorithm predicts a high IMDb rating category.

2. **Accuracy (`acc_imdb`)**: Each fold in the 5-fold cross-validation has an accuracy of 1.0 (or 100%). This indicates perfect classification accuracy for this subset in each fold.

3. **Mean Accuracy (`mean_acc_imdb`)**: The mean accuracy across all folds is also 1.0, indicating consistently high accuracy.

4. **Unclassified (`uc_imdb`)**: The value is 0.0, meaning that all instances were successfully classified, and none were left unclassified.

