# Exploratory Data Analysis (EDA)

# EDA Plan

## 1. Preliminary Steps

### Load Data

In [24]:
# Import necessary libraries
import pandas as pd

# File path configuration
file_path = '../data/raw/rawdata.tsv' 

# Load the dataset into a Pandas DataFrame, skipping the first line
df = pd.read_csv(file_path, sep='\t', skiprows=1)  # Skipping the first row due to format issues
print("\nDataset loaded successfully!")

# Step 1: Rename Columns, according to export format given with the extension's feature. 
column_names = [
    "URL",              # The visited URL
    "Host",             # The hostname of the URL
    "Domain",           # The domain of the URL
    "Visit Time (ms)",  # The visit time in milliseconds
    "Visit Time",       # The visit time as a string in local time
    "Day of Week",      # The day of the week for the visit time
    "Transition Type",  # How the browser navigated to the URL
    "Page Title"        # The title of the visited URL
]

# Assign new column names to the DataFrame
df.columns = column_names
print("\nColumns named successfully!")
print(f"Updated columns: {df.columns.tolist()}")



Dataset loaded successfully!

Columns named successfully!
Updated columns: ['URL', 'Host', 'Domain', 'Visit Time (ms)', 'Visit Time', 'Day of Week', 'Transition Type', 'Page Title']


### Inspect Structure

In [25]:
# Display the first few rows of the dataset
print("\nFirst few rows of the dataset:")
print(df.head())

# Check data types and column information
print("\nDataset Info:")
print(df.info())

# Summary statistics for numerical columns
print("\nSummary statistics for numerical columns:")
print(df.describe())

# Investigate "Day of Week" as a categorical variable
print("\nDay of Week Distribution (Categorical Variable):")
days_of_week = {
    0: "Sunday",
    1: "Monday",
    2: "Tuesday",
    3: "Wednesday",
    4: "Thursday",
    5: "Friday",
    6: "Saturday"
}
df['Day of Week Label'] = df['Day of Week'].map(days_of_week)  # Map integer values to day names
print(df['Day of Week Label'].value_counts())

# Summary statistics for other categorical columns
print("\nSummary statistics for other categorical columns:")
categorical_columns = df.select_dtypes(include=['object']).columns
for col in categorical_columns:
    if col != 'Day of Week Label':  # Skip "Day of Week Label" since it was just analyzed
        print(f"\nColumn: {col}")
        print(df[col].value_counts())


First few rows of the dataset:
                                                 URL Host Domain  \
0  file:///Applications/CS310_Project_BackEnd%20(...  NaN    NaN   
1  file:///Applications/CS310_Project_BackEnd%20(...  NaN    NaN   
2  file:///Applications/CS310_Project_BackEnd%20(...  NaN    NaN   
3  file:///Applications/CS310_Project_BackEnd%20(...  NaN    NaN   
4     file:///Applications/Case1_Questions%20(1).pdf  NaN    NaN   

   Visit Time (ms)           Visit Time  Day of Week Transition Type  \
0     1.731682e+12  2024-11-15 17:42:17            5            link   
1     1.731689e+12  2024-11-15 19:50:56            5            link   
2     1.731689e+12  2024-11-15 19:51:10            5           typed   
3     1.731696e+12  2024-11-15 21:34:31            5            link   
4     1.729269e+12  2024-10-18 19:35:08            5            link   

                          Page Title  
0  CS310_Project_BackEnd (8) (1).pdf  
1  CS310_Project_BackEnd (8) (1).pdf  
2  CS310_

#### Dataset Overview

##### First Few Rows
Columns: URL, Host, Domain, Visit Time (ms), Visit Time, Day of Week, Transition Type, Page Title

##### Key Observations
- Columns like Host and Domain have missing values (NaN). Which is indicated or warned by the extension's export feature.
- URL provides the visited links, while Transition Type shows navigation type (link, typed).

##### Dataset Info
- Total Rows: 65,839(Index start from 0).
- Total Columns: 8

##### Column Types
- float(Numerical Data Type): 1 column (e.g., Visit Time (ms))
- int(Catagorical Variable): 1 column (The day of the week for the visit time. Values are 0-6. Sunday=0, Monday=1, etc.)
- object: 6 columns
    - URL: Identifier variable (unique for each row, represents the visited URL)
    - Host: Categorical variable (represents the hostname of the URL)
    - Domain: Categorical variable (represents the domain of the URL, grouped based on the public suffix)
    - Visit Time (string): Date/Time variable (provides the exact timestamp of the visit)
    - Transition Type: Categorical variable (indicates how the browser navigated to the URL, e.g., link, typed, reload)
    - Page Title: Categorical variable (represents the title of the visited page)

##### Non-Null Counts
- Fully populated columns: URL, Visit Time (ms), Day of Week, Transition Type
- Columns with missing data: Host, Domain, Page Title (As indicated by the export feature).

##### Summary Statistics for Numerical Column

- **Visit Time (minutes)**:
  - Mean: 28779655.0 minutes
  - Standard Deviation: 78057.12 minutes
  - Minimum: 28623370.0 minutes
  - 25th Percentile: 28710733.0 minutes
  - 50th Percentile: 28812807.5 minutes
  - 75th Percentile: 28845016.7 minutes
  - Maximum: 28882166.7 minutes

##### Summary of Objects(Categorical/Identifier/Date) Variables

###### Day of Week
- **Distribution of days:**
  - **Tuesday:** 11,033
  - **Wednesday:** 10,990
  - **Monday:** 9,766
  - **Thursday:** 9,292
  - **Sunday:** 8,901
  - **Saturday:** 8,246
  - **Friday:** 7,611

###### URL
- **Total unique URLs:** 16,812
- **Most common URL:** https://mail.google.com/mail/u/2/#inbox (1,073 occurrences)

###### Host
- **Total unique hosts:** 1,349
- **Most common host:** sucourse.sabanciuniv.edu (18,029 occurrences)

###### Domain
- **Total unique domains:** 1,047
- **Most common domain:** sabanciuniv.edu (25,630 occurrences)

##### Visit Time
- **Total unique timestamps:** 56,280
- **Most frequent timestamp:** 2024-11-04 03:35:00 (43 occurrences)

###### Transition Type
- **Types of transitions:**
  - **link:** 49,865
  - **generated:** 4,431
  - **reload:** 3,540
  - **form_submit:** 3,397
  - **auto_bookmark:** 2,995
  - **typed:** 1,320
  - **auto_toplevel:** 286
  - **manual_subframe:** 5

###### Page Title
- **Total unique titles:** 10,041
- **Most common title:** mySU (1,647 occurrences)

### Missing Data

In [26]:
# Identify missing values in each column
missing_values = df.isnull().sum()
print("Missing Values Count:")
print(missing_values)

Missing Values Count:
URL                     0
Host                  832
Domain               1346
Visit Time (ms)         0
Visit Time              0
Day of Week             0
Transition Type         0
Page Title            396
Day of Week Label       0
dtype: int64


#### Results

Missing Values Count:
- **URL:** 0
- **Host:** 832
- **Domain:** 1346
- **Visit Time (ms):** 0
- **Visit Time:** 0
- **Day of Week:** 0
- **Transition Type:** 0
- **Page Title:** 396
- **Day of Week Label:** 0

In [27]:
# Replace missing values with "Unknown"
df.fillna("Unknown", inplace=True)

### Save Data Frame

In [28]:
# Define the output file path
output_file_path = '../data/processed/processedata.csv'

# Save the DataFrame to a CSV file
df.to_csv(output_file_path, index=False)  # index=False prevents adding the index column to the file

## 2. Data Quality Checks & Manipulation

### Initial Manipulation

In [30]:
file_path = '../data/processed/processedata.csv'
df = pd.read_csv(file_path)

# Step 1: Remove the "Transition Type" column
df = df.drop(columns=['Transition Type'])

# Step 2: Convert "Visit Time (ms)" to minutes
df['Visit Time (minutes)'] = df['Visit Time (ms)'] / (1000 * 60)  # Convert milliseconds to minutes
df = df.drop(columns=['Visit Time (ms)'])  # Drop the original column

# Save the updated DataFrame back to the processed data file
df.to_csv(file_path, index=False)

#### Operations Performed:
1. **Column Removed**: 
   - The column `Transition Type` was removed because it is irrelevant to the analysis.

2. **Conversion of Visit Time**:
   - The column `Visit Time (ms)` was converted from milliseconds to minutes.
   - A new column named `Visit Time (minutes)` was added.
   - The original `Visit Time (ms)` column was removed to maintain clarity.

#### Updated Data:
- The changes were applied to the `processed_data` dataset.