# Exploratory Data Analysis (EDA)

## 1. Preliminary Steps

### Load Data

In [None]:
# Import necessary libraries
import pandas as pd

# File path configuration
file_path = '../data/raw/rawdata.tsv' 

# Load the dataset into a Pandas DataFrame, skipping the first line
df = pd.read_csv(file_path, sep='\t', skiprows=1)  # Skipping the first row due to format issues
print("\nDataset loaded successfully!")

# Step 1: Rename Columns, according to export format given with the extension's feature. 
column_names = [
    "URL",              # The visited URL
    "Host",             # The hostname of the URL
    "Domain",           # The domain of the URL
    "Visit Time (ms)",  # The visit time in milliseconds
    "Visit Time",       # The visit time as a string in local time
    "Day of Week",      # The day of the week for the visit time
    "Transition Type",  # How the browser navigated to the URL
    "Page Title"        # The title of the visited URL
]

# Assign new column names to the DataFrame
df.columns = column_names
print("\nColumns named successfully!")
print(f"Updated columns: {df.columns.tolist()}")


### Inspect Structure

In [None]:
# Display the first few rows of the dataset
print("\nFirst few rows of the dataset:")
print(df.head())

# Check data types and column information
print("\nDataset Info:")
print(df.info())

# Investigate "Day of Week" as a categorical variable
print("\nDay of Week Distribution (Categorical Variable):")
days_of_week = {
    0: "Sunday",
    1: "Monday",
    2: "Tuesday",
    3: "Wednesday",
    4: "Thursday",
    5: "Friday",
    6: "Saturday"
}
# Analyze the distribution using the mapping without modifying the dataset
day_of_week_distribution = df['Day of Week'].map(days_of_week).value_counts()
print(day_of_week_distribution)

# Summary statistics for other categorical columns
print("\nSummary statistics for other categorical columns:")
categorical_columns = df.select_dtypes(include=['object']).columns
for col in categorical_columns:
    print(f"\nColumn: {col}")
    print(df[col].value_counts())


#### Dataset Overview

##### First Few Rows
Columns: URL, Host, Domain, Visit Time (ms), Visit Time, Day of Week, Transition Type, Page Title

##### Key Observations
- Columns like Host, Domain, and Page Title have missing values (NaN). Which is indicated or warned by the extension's export feature.
- URL provides the visited links, while Transition Type shows navigation type (link, typed).

##### Dataset Info
- Total Rows: 65,839(Index start from 0).
- Total Columns: 8

##### Column Types
- float(Date/Time Data Type): 1 column (e.g., Visit Time (ms)) 
- int(Catagorical Variable): 1 column (The day of the week for the visit time. Values are 0-6. Sunday=0, Monday=1, etc.)
- object: 6 columns
    - URL: Identifier variable (unique for each row, represents the visited URL)
    - Host: Categorical variable (represents the hostname of the URL)
    - Domain: Categorical variable (represents the domain of the URL, grouped based on the public suffix)
    - Visit Time (string): Date/Time variable (provides the exact timestamp of the visit)
    - Transition Type: Categorical variable (indicates how the browser navigated to the URL, e.g., link, typed, reload)
    - Page Title: Categorical variable (represents the title of the visited page)

##### Non-Null Counts
- Fully populated columns: URL, Visit Time (ms), Visit Time (String), Day of Week, Transition Type
- Columns with missing data: Host, Domain, Page Title (As indicated by the export feature).

##### Summary of Objects(Categorical/Identifier/Date) Variables

###### Day of Week
- **Distribution of days:**
  - **Tuesday:** 11,033
  - **Wednesday:** 10,990
  - **Monday:** 9,766
  - **Thursday:** 9,292
  - **Sunday:** 8,901
  - **Saturday:** 8,246
  - **Friday:** 7,611

###### URL
- **Total unique URLs:** 16,812
- **Most common URL:** https://mail.google.com/mail/u/2/#inbox (1,073 occurrences)

###### Host
- **Total unique hosts:** 1,349
- **Most common host:** sucourse.sabanciuniv.edu (18,029 occurrences)

###### Domain
- **Total unique domains:** 1,047
- **Most common domain:** sabanciuniv.edu (25,630 occurrences)

##### Visit Time
- **Total unique timestamps:** 56,280
- **Most frequent timestamp:** 2024-11-04 03:35:00 (43 occurrences)

###### Transition Type
- **Types of transitions:**
  - **link:** 49,865
  - **generated:** 4,431
  - **reload:** 3,540
  - **form_submit:** 3,397
  - **auto_bookmark:** 2,995
  - **typed:** 1,320
  - **auto_toplevel:** 286
  - **manual_subframe:** 5

###### Page Title
- **Total unique titles:** 10,041
- **Most common title:** mySU (1,647 occurrences)

### Missing Data

In [None]:
# Identify missing values in each column
missing_values = df.isnull().sum()
print("Missing Values Count:")
print(missing_values)

#### Results

Missing Values Count:
- **URL:** 0
- **Host:** 832
- **Domain:** 1346
- **Visit Time (ms):** 0
- **Visit Time:** 0
- **Day of Week:** 0
- **Transition Type:** 0
- **Page Title:** 396
- **Day of Week Label:** 0

In [136]:
# Replace missing values with "Unknown"
df.fillna("Unknown", inplace=True)

#### Handeling Missing Data

Any missing values were replaced with "Unknown" for further and better data analysis.

### Save Data Frame

In [137]:
# Define the output file path
output_file_path = '../data/processed/processedata.csv'

# Save the DataFrame to a CSV file
df.to_csv(output_file_path, index=False)  # index=False prevents adding the index column to the file

## 2. Data Checks & Manipulation

### Initial Manipulation

In [138]:
file_path = '../data/processed/processedata.csv'
df = pd.read_csv(file_path)

# Remove the "Transition Type" and "Visit Time (ms)" columns
df = df.drop(columns=['Visit Time (ms)']) 
df = df.drop(columns=['Transition Type'])

# Save the updated DataFrame back to the processed data file
df.to_csv(file_path, index=False)

#### Operations Performed:

#### Columns Removed: 
- The column `Transition Type` and `Visit Time (ms)` were removed because of there irrelevance to the analysis.

#### Updated Data:
- The changes were applied to the `processed_data` dataset.

### Check for Duplicates

In [None]:
# File path to the processed data
file_path = '../data/processed/processedata.csv'

# Load the processed dataset
df = pd.read_csv(file_path)

# Check for duplicates and count them
duplicate_count = df.duplicated().sum()

# Output the duplicate count
print(f"Number of duplicate rows: {duplicate_count}")

#### Results
No duplicate rows were found in the dataset. This indicates that all rows are unique and no redundant data exists.


## 3. Exploratory Analysis

### Attribute Analysis

Must interpret output for report.

In [None]:
import matplotlib.pyplot as plt # Import the Matplotlib library. 

# Analyze categorical attributes
categorical_columns = ['Host', 'Domain', 'Page Title', 'Day of Week']

# Frequencies and bar charts for categorical attributes
for col in categorical_columns:
    print(f"\nFrequencies for {col}:")
    print(df[col].value_counts())
    
    # Create bar chart for visualization
    plt.figure(figsize=(10, 6))
    df[col].value_counts().head(10).plot(kind='bar', color='skyblue')
    plt.title(f"Top 10 Most Frequent Values in {col}")
    plt.xlabel(col)
    plt.ylabel("Frequency")
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

# Specific Analysis for "Day of Week"
if 'Day of Week' in df.columns:
    days_of_week_labels = {
        0: "Sunday",
        1: "Monday",
        2: "Tuesday",
        3: "Wednesday",
        4: "Thursday",
        5: "Friday",
        6: "Saturday"
    }
    # Map integers to day names for better analysis
    df['Day of Week Label'] = df['Day of Week'].map(days_of_week_labels)

    print("\nDay of Week Frequency (Categorical):")
    print(df['Day of Week Label'].value_counts())

    # Bar chart for Day of Week distribution
    plt.figure(figsize=(8, 6))
    df['Day of Week Label'].value_counts().plot(kind='bar', color='lightcoral')
    plt.title("Distribution of Visits by Day of Week")
    plt.xlabel("Day of Week")
    plt.ylabel("Frequency")
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

### Temporal Analysis of Visit Time

Must interpret output for report.

In [None]:
# Convert 'Visit Time (string)' to datetime format
df['Visit Time Parsed'] = pd.to_datetime(df['Visit Time'], format='%Y-%m-%d %H:%M:%S')

# Extract temporal components
df['Hour'] = df['Visit Time Parsed'].dt.hour  # Extract hour
df['Day'] = df['Visit Time Parsed'].dt.day  # Extract day
# Extract day of the week (0=Monday, 6=Sunday)
df['Day of Week (Name)'] = df['Visit Time Parsed'].dt.day_name()

# Analyze hourly browsing patterns
hourly_counts = df['Hour'].value_counts().sort_index()

# Plot hourly browsing patterns
plt.figure(figsize=(10, 6))
hourly_counts.plot(kind='bar', color='skyblue')
plt.title('Browsing Patterns by Hour')
plt.xlabel('Hour of the Day')
plt.ylabel('Frequency')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

# Analyze daily browsing patterns
daily_counts = df['Day of Week (Name)'].value_counts()

# Plot daily browsing patterns
plt.figure(figsize=(10, 6))
daily_counts.plot(kind='bar', color='lightcoral')
plt.title('Browsing Patterns by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Combine Day of Week and Hour for more detailed analysis
hour_day_counts = df.groupby(['Day of Week (Name)', 'Hour']).size().unstack()

# Heatmap for Hour and Day of Week
plt.figure(figsize=(12, 8))
plt.imshow(hour_day_counts, aspect='auto', cmap='viridis', interpolation='nearest')
plt.title('Heatmap of Browsing Patterns (Day vs Hour)')
plt.xlabel('Hour of the Day')
plt.ylabel('Day of the Week')
plt.colorbar(label='Frequency')
plt.xticks(range(len(hour_day_counts.columns)), hour_day_counts.columns, rotation=45)
plt.yticks(range(len(hour_day_counts.index)), hour_day_counts.index)
plt.tight_layout()
plt.show()

### Academic vs Non-Academic Activity Analysis

Must interpret output for report.

In [None]:
# Updated list of academic domains
academic_domains = [
    'sabanciuniv.edu', 'edu', 'researchgate.net', 'arxiv.org', 'sciencedirect.com', 
    'springer.com', 'jstor.org', 'ieee.org', 'acm.org', 'nature.com', 'wiley.com',
    'udemy.com', 'w3schools.com', 'chegg.com', 'gradescope.com', 'geeksforgeeks.org',
    'pearson.com', 'ets.org', 'openai.com', 'notion.so', 'notion.site', 'claude.ai',
    'github.com', 'jsmastery.pro', 'ilovepdf.com', 'symplicity.com', 'zoom.us'
]

# Categorize websites into Academic and Non-Academic
df['Activity Type'] = df['Domain'].apply(
    lambda x: 'Academic' if any(domain in str(x) for domain in academic_domains) else 'Non-Academic'
)

# Analyze frequencies of Academic vs Non-Academic
activity_counts = df['Activity Type'].value_counts()
print("Activity Type Frequencies:")
print(activity_counts)

# Plot activity type distribution
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 6))
activity_counts.plot(kind='bar', color=['blue', 'orange'])
plt.title("Academic vs Non-Academic Activity")
plt.xlabel("Activity Type")
plt.ylabel("Frequency")
plt.show()

### Peak Browsing Periods

Must interpret output for report.

In [None]:
# Extract hour of visit from 'Visit Time (string)'
df['Visit Hour'] = pd.to_datetime(df['Visit Time']).dt.hour

# Analyze hourly activity
hourly_activity = df['Visit Hour'].value_counts().sort_index()
print("Hourly Activity Distribution:")
print(hourly_activity)

# Plot hourly activity
plt.figure(figsize=(10, 6))
hourly_activity.plot(kind='bar', color='green')
plt.title("Hourly Browsing Activity")
plt.xlabel("Hour of the Day")
plt.ylabel("Frequency")
plt.xticks(rotation=0)
plt.show()

### Behavioral Patterns During Academic and Leisure Days

Must interpret output for report.

In [None]:
# Classify days as weekday or weekend
df['Day Type'] = df['Day of Week'].apply(lambda x: 'Weekday' if x in [1, 2, 3, 4, 5] else 'Weekend')

# Analyze frequencies of visits during weekdays and weekends
day_type_counts = df['Day Type'].value_counts()
print("Weekday vs Weekend Activity:")
print(day_type_counts)

# Plot day type activity distribution
plt.figure(figsize=(8, 6))
day_type_counts.plot(kind='bar', color='cyan')
plt.title("Weekday vs Weekend Browsing Activity")
plt.xlabel("Day Type")
plt.ylabel("Frequency")
plt.show()