# Movie Log Data Processing

This notebook processes movie log data, separating '/data/' and '/rate/' entries. It handles large files by processing data in chunks.

## Imports and Configuration


In [None]:
import pandas as pd

# Configuration
INPUT_FILE = "../data/movie_logs.csv"
OUTPUT_FILE_DATA = "../data/processed_data_entries.csv"
OUTPUT_FILE_RATE = "../data/processed_rate_entries.csv"
CHUNKSIZE = 1000000  # Smaller chunk size for testing
# TEST_ROWS = 10000  # Number of rows to test


## Data Processing Function

This function processes a chunk of data, separating it into 'Data' and 'Rate' entries.

- **Input**: A chunk of the CSV file (pandas DataFrame)
- **Output**: Two DataFrames (data_entries, rate_entries)

### Key Operations:
1. Iterates through each row in the chunk
2. Identifies 'Data' (/data/) and 'Rate' (/rate/) requests
3. Extracts relevant information:
   - For 'Data': timestamp, user_id, movie_title, watched_minutes
   - For 'Rate': timestamp, user_id, movie_title, rating
4. Cleans movie titles (replaces '+' with spaces)
5. Handles potential errors in data processing

### Note:
- Skips incomplete data entries
- Returns separate DataFrames for 'Data' and 'Rate' entries


In [14]:
def process_chunk(chunk):
    """Process a chunk of data and return transformed DataFrames for Data and Rate entries"""
    data_entries = []
    rate_entries = []
    
    for _, row in chunk.iterrows():
        try:
            req = row["request"]
            
            if req.startswith("GET /data/"):
                # Process Data entry
                parts = req.split("/data/")[1].split("/")
                if len(parts) < 3:
                    continue
                
                movie_id = parts[1].replace("+", " ")
                minutes = parts[2].split(".")[0]
                
                data_entries.append({
                    "timestamp": row["timestamp"],
                    "user_id": row["request_id"],
                    "movie_title": movie_id,
                    "watched_minutes": minutes
                })
                
            elif req.startswith("GET /rate/"):
                # Process Rate entry
                rate_part = req.split("/rate/")[1]
                movie_rating = rate_part.split("=")
                if len(movie_rating) != 2:
                    continue
                
                movie_id = movie_rating[0].replace("+", " ")
                
                rate_entries.append({
                    "timestamp": row["timestamp"],
                    "user_id": row["request_id"],
                    "movie_title": movie_id,
                    "rating": movie_rating[1]
                })
                
        except Exception as e:
            print(f"Error processing row: {e}")
            continue
    
    return pd.DataFrame(data_entries), pd.DataFrame(rate_entries)


## Data Processing and File Writing

- Initializes separate output files for 'Data' and 'Rate' entries with headers
- Processes input file in chunks:
  - Uses `process_chunk` function
  - Appends results to respective output files
  - Tracks progress
- Prints total processed entries
- Previews first 5 rows of each output file

Note: Efficiently handles large datasets by processing in chunks.


In [15]:
# Initialize the output file with headers
data_headers = ["timestamp", "user_id", "movie_title", "watched_minutes"]
rate_headers = ["timestamp", "user_id", "movie_title", "rating"]

pd.DataFrame(columns=rate_headers).to_csv(OUTPUT_FILE_RATE, index=False)
pd.DataFrame(columns=data_headers).to_csv(OUTPUT_FILE_DATA, index=False)

# Process data in chunks and write to file
total_data_entries = 0
total_rate_entries = 0

for chunk in pd.read_csv(INPUT_FILE, header=None, 
                        names=["timestamp", "request_id", "request"],
                        chunksize=CHUNKSIZE):
    data_df, rate_df = process_chunk(chunk)
    
    # Append data entries to the file
    data_df.to_csv(OUTPUT_FILE_DATA, mode='a', header=False, index=False)
    total_data_entries += len(data_df)
    
    # Append rate entries to the file
    rate_df.to_csv(OUTPUT_FILE_RATE, mode='a', header=False, index=False)
    total_rate_entries += len(rate_df)
    
    # Optional: Print progress
    print(f"Processed {total_data_entries} data entries and {total_rate_entries} rate entries so far...")

# Print final results
print(f"Total processed: {total_data_entries} data entries and {total_rate_entries} rate entries")
print(f"All entries saved to files")

# Preview the results (reading just the first few lines of the output file)
print("\nPreview of Data file entries:")
print(pd.read_csv(OUTPUT_FILE_DATA, nrows=5))
print("\nPreview of Rate file entries:")
print(pd.read_csv(OUTPUT_FILE_RATE, nrows=5))


Processed 993030 data entries and 6970 rate entries so far...
Processed 1985946 data entries and 14054 rate entries so far...
Processed 2978706 data entries and 21294 rate entries so far...
Processed 3971502 data entries and 28498 rate entries so far...
Processed 4964504 data entries and 35496 rate entries so far...
Processed 5957825 data entries and 42175 rate entries so far...
Processed 6951489 data entries and 48511 rate entries so far...
Processed 7945459 data entries and 54541 rate entries so far...
Processed 8939433 data entries and 60567 rate entries so far...
Processed 9933687 data entries and 66313 rate entries so far...
Processed 10927694 data entries and 72306 rate entries so far...
Processed 11921707 data entries and 78293 rate entries so far...
Processed 12915694 data entries and 84306 rate entries so far...
Processed 13909537 data entries and 90463 rate entries so far...
Processed 14903290 data entries and 96710 rate entries so far...
Processed 15897162 data entries and 1

In [22]:

processed_file = '../data/processed_data_entries.csv'
data_preview = pd.read_csv(processed_file, nrows=100)
data_preview

Unnamed: 0,timestamp,user_id,movie_title,watched_minutes
0,2025-02-28T03:36:47,122156,star wars 1977,6
1,2025-02-28T03:36:47,242890,the five senses 1999,100
2,2025-02-28T03:36:47,235684,clara and me 2004,64
3,2025-02-28T03:36:47,159468,the shawshank redemption 1994,64
4,2025-02-28T03:36:47,103729,tangled 2010,66
...,...,...,...,...
95,2025-02-28T03:36:47,149485,sos coast guard 1937,195
96,2025-02-28T03:36:47,116756,the tunnel 2001,117
97,2025-02-28T03:36:47,140141,in the mood for love 2000,1
98,2025-02-28T03:36:47,291627,sky captain and the world of tomorrow 2004,69
