# Exploratory Data Analysis (EDA)

## 1. Preliminary Steps

### Load Data

In [None]:
# Import necessary libraries
import pandas as pd

# File path configuration
file_path = '../data/raw/rawdata.tsv' 

# Load the dataset into a Pandas DataFrame, skipping the first line
df = pd.read_csv(file_path, sep='\t', skiprows=1)  # Skipping the first row due to format issues
print("\nDataset loaded successfully!")

# Step 1: Rename Columns, according to export format given with the extension's feature. 
column_names = [
    "URL",              # The visited URL
    "Host",             # The hostname of the URL
    "Domain",           # The domain of the URL
    "Visit Time (ms)",  # The visit time in milliseconds
    "Visit Time",       # The visit time as a string in local time
    "Day of Week",      # The day of the week for the visit time
    "Transition Type",  # How the browser navigated to the URL
    "Page Title"        # The title of the visited URL
]

# Assign new column names to the DataFrame
df.columns = column_names
print("\nColumns named successfully!")
print(f"Updated columns: {df.columns.tolist()}")


### Inspect Structure

In [None]:
# Display the first few rows of the dataset
print("\nFirst few rows of the dataset:")
print(df.head())

# Check data types and column information
print("\nDataset Info:")
print(df.info())

# Investigate "Day of Week" as a categorical variable
print("\nDay of Week Distribution (Categorical Variable):")
days_of_week = {
    0: "Sunday",
    1: "Monday",
    2: "Tuesday",
    3: "Wednesday",
    4: "Thursday",
    5: "Friday",
    6: "Saturday"
}
# Analyze the distribution using the mapping without modifying the dataset
day_of_week_distribution = df['Day of Week'].map(days_of_week).value_counts()
print(day_of_week_distribution)

# Summary statistics for other categorical columns
print("\nSummary statistics for other categorical columns:")
categorical_columns = df.select_dtypes(include=['object']).columns
for col in categorical_columns:
    print(f"\nColumn: {col}")
    print(df[col].value_counts())


#### Dataset Overview

##### First Few Rows
Columns: URL, Host, Domain, Visit Time (ms), Visit Time, Day of Week, Transition Type, Page Title

##### Key Observations
- Columns like Host, Domain, and Page Title have missing values (NaN). Which is indicated or warned by the extension's export feature.
- URL provides the visited links, while Transition Type shows navigation type (link, typed).

##### Dataset Info
- Total Rows: 65,839(Index start from 0).
- Total Columns: 8

##### Column Types
- float(Date/Time Data Type): 1 column (e.g., Visit Time (ms)) 
- int(Catagorical Variable): 1 column (The day of the week for the visit time. Values are 0-6. Sunday=0, Monday=1, etc.)
- object: 6 columns
    - URL: Identifier variable (unique for each row, represents the visited URL)
    - Host: Categorical variable (represents the hostname of the URL)
    - Domain: Categorical variable (represents the domain of the URL, grouped based on the public suffix)
    - Visit Time (string): Date/Time variable (provides the exact timestamp of the visit)
    - Transition Type: Categorical variable (indicates how the browser navigated to the URL, e.g., link, typed, reload)
    - Page Title: Categorical variable (represents the title of the visited page)

##### Non-Null Counts
- Fully populated columns: URL, Visit Time (ms), Visit Time (String), Day of Week, Transition Type
- Columns with missing data: Host, Domain, Page Title (As indicated by the export feature).

##### Summary of Objects(Categorical/Identifier/Date) Variables

###### Day of Week
- **Distribution of days:**
  - **Tuesday:** 11,033
  - **Wednesday:** 10,990
  - **Monday:** 9,766
  - **Thursday:** 9,292
  - **Sunday:** 8,901
  - **Saturday:** 8,246
  - **Friday:** 7,611

###### URL
- **Total unique URLs:** 16,812
- **Most common URL:** https://mail.google.com/mail/u/2/#inbox (1,073 occurrences)

###### Host
- **Total unique hosts:** 1,349
- **Most common host:** sucourse.sabanciuniv.edu (18,029 occurrences)

###### Domain
- **Total unique domains:** 1,047
- **Most common domain:** sabanciuniv.edu (25,630 occurrences)

##### Visit Time
- **Total unique timestamps:** 56,280
- **Most frequent timestamp:** 2024-11-04 03:35:00 (43 occurrences)

###### Transition Type
- **Types of transitions:**
  - **link:** 49,865
  - **generated:** 4,431
  - **reload:** 3,540
  - **form_submit:** 3,397
  - **auto_bookmark:** 2,995
  - **typed:** 1,320
  - **auto_toplevel:** 286
  - **manual_subframe:** 5

###### Page Title
- **Total unique titles:** 10,041
- **Most common title:** mySU (1,647 occurrences)

### Missing Data

In [None]:
# Identify missing values in each column
missing_values = df.isnull().sum()
print("Missing Values Count:")
print(missing_values)

#### Results

Missing Values Count:
- **URL:** 0
- **Host:** 832
- **Domain:** 1346
- **Visit Time (ms):** 0
- **Visit Time:** 0
- **Day of Week:** 0
- **Transition Type:** 0
- **Page Title:** 396
- **Day of Week Label:** 0

In [118]:
# Replace missing values with "Unknown"
df.fillna("Unknown", inplace=True)

#### Handeling Missing Data

Any missing values were replaced with "Unknown" for further and better data analysis.

### Save Data Frame

In [119]:
# Define the output file path
output_file_path = '../data/processed/processedata.csv'

# Save the DataFrame to a CSV file
df.to_csv(output_file_path, index=False)  # index=False prevents adding the index column to the file

## 2. Data Checks & Manipulation

### Initial Manipulation

In [120]:
file_path = '../data/processed/processedata.csv'
df = pd.read_csv(file_path)

# Remove the "Transition Type" and "Visit Time (ms)" columns
df = df.drop(columns=['Visit Time (ms)']) 
df = df.drop(columns=['Transition Type'])

# Save the updated DataFrame back to the processed data file
df.to_csv(file_path, index=False)

#### Operations Performed:

#### Columns Removed: 
- The column `Transition Type` and `Visit Time (ms)` were removed because of there irrelevance to the analysis.

#### Updated Data:
- The changes were applied to the `processed_data` dataset.

### Check for Duplicates

In [None]:
# File path to the processed data
file_path = '../data/processed/processedata.csv'

# Load the processed dataset
df = pd.read_csv(file_path)

# Check for duplicates and count them
duplicate_count = df.duplicated().sum()

# Output the duplicate count
print(f"Number of duplicate rows: {duplicate_count}")

#### Results
No duplicate rows were found in the dataset. This indicates that all rows are unique and no redundant data exists.


## 3. Exploratory Analysis

### Attribute Analysis