# **Exploratory Data Analysis (EDA)**# 

**This notebook provides an exploratory data analysis of the developer role classification dataset. The goal is to understand the dataset's structure, identify potential issues, visualize feature distributions, and gain insights that can inform preprocessing and modeling decisions.**

# **Load Data**

In [None]:
import padas as pd
file_path = '/content/final_dataset.csv'
try:
    df = pd.read_csv(file_path)
    print("Dataset loaded successfully.")
    print(f"Dataset shape: {df.shape}")
except FileNotFoundError:
    print(f"Error: Dataset not found at {file_path}")
    df = None # Set df to None if loading fails

In [None]:
if df is not None:
    print("\nFirst 5 rows of the dataset:")
    display(df.head())

    print("\nColumn information:")
    df.info()

    print("\nMissing values per column:")
    print(df.isnull().sum())
else:
    print("Data not loaded, skipping overview.")

# **Target Variable Distribution**

**Let's examine the distribution of the target variable, 'role'. This is important to understand if there is any class imbalance.**

In [None]:
if df is not None:
    plt.figure(figsize=(8, 6))
    sns.countplot(x='role', data=df, palette='viridis')
    plt.title('Distribution of Developer Roles')
    plt.xlabel('Developer Role')
    plt.ylabel('Count')
    plt.show()

    print("\nValue counts for 'role':")
    print(df['role'].value_counts())
    print("\nPercentage distribution of 'role':")
    print(df['role'].value_counts(normalize=True).round(3))
else:
     print("Data not loaded, skipping target distribution analysis.")

# **Explore Numerical Features**

**Let's analyze the distribution of numerical features like numfileschanged, linesadded, linesdeleted, and numcommentsadded. We can use histograms to visualize their distributions**

In [None]:
if df is not None:
    numerical_features = ['numfileschanged', 'linesadded', 'linesdeleted', 'numcommentsadded']

    plt.figure(figsize=(15, 10))
    for i, feature in enumerate(numerical_features):
        plt.subplot(2, 2, i + 1)
        sns.histplot(df[feature], kde=True, bins=30, color='skyblue')
        plt.title(f'Distribution of {feature}')
        plt.xlabel(feature)
        plt.ylabel('Frequency')
    plt.tight_layout()
    plt.show()
else:
    print("Data not loaded, skipping numerical feature analysis.")

# **Explore Categorical Features**

**Let's look at the distribution of the committype.**

In [None]:
if df is not None:
    plt.figure(figsize=(8, 6))
    sns.countplot(x='committype', data=df, palette='viridis')
    plt.title('Distribution of Commit Types')
    plt.xlabel('Commit Type')
    plt.ylabel('Count')
    plt.show()

    print("\nValue counts for 'committype':")
    print(df['committype'].value_counts())
    print("\nPercentage distribution of 'committype':")
    print(df['committype'].value_counts(normalize=True).round(3))
else:
     print("Data not loaded, skipping categorical feature analysis.")

# **Explore File Extensions**

The fileextensions column is a string representation of a list. We need to process it to understand the distribution of file types.

In [None]:
if df is not None:
    # Function to safely evaluate the string list and flatten it
    def flatten_extensions(ext_str):
        try:
            # Safely evaluate the string representation of the list
            extensions_list = ast.literal_eval(ext_str)
            # Ensure it's a list of strings, filter out non-string or empty values
            valid_extensions = [ext.strip().replace('.', '') for ext in extensions_list if isinstance(ext, str) and ext.strip()]
            return valid_extensions
        except (ValueError, SyntaxError):
            return [] # Return empty list if evaluation fails or invalid format

    # Apply the function and explode the list of extensions into separate rows
    all_extensions = df['fileextensions'].apply(flatten_extensions).explode()

    # Get the most common extensions
    if not all_extensions.empty:
        plt.figure(figsize=(12, 8))
        all_extensions_counts = all_extensions.value_counts().head(20) # Top 20
        sns.barplot(x=all_extensions_counts.values, y=all_extensions_counts.index, palette='viridis')
        plt.title('Top 20 Most Frequent File Extensions')
        plt.xlabel('Frequency')
        plt.ylabel('File Extension')
        plt.show()

        print("\nTop 20 most frequent file extensions:")
        print(all_extensions_counts)
    else:
        print("No valid file extensions found after processing.")
else:
    print("Data not loaded, skipping file extensions analysis.")

# **Explore Commit Messages**

**Analyzing commit messages directly can be insightful. We can look at message lengths and word counts.**

In [None]:
if df is not None:
    # Calculate message length and word count
    df['message_length'] = df['commitmessage'].str.len()
    df['word_count'] = df['commitmessage'].str.split().str.len()

    # Plot distributions
    plt.figure(figsize=(15, 6))

    plt.subplot(1, 2, 1)
    sns.histplot(df['message_length'], kde=True, bins=50, color='salmon')
    plt.title('Distribution of Commit Message Lengths')
    plt.xlabel('Message Length')
    plt.ylabel('Frequency')

    plt.subplot(1, 2, 2)
    sns.histplot(df['word_count'], kde=True, bins=50, color='lightgreen')
    plt.title('Distribution of Commit Message Word Counts')
    plt.xlabel('Word Count')
    plt.ylabel('Frequency')

    plt.tight_layout()
    plt.show()

    print("\nSummary statistics for Message Length:")
    print(df['message_length'].describe())

    print("\nSummary statistics for Word Count:")
    print(df['word_count'].describe())
else:
    print("Data not loaded, skipping commit message analysis.")

# **Insights and Observations from Exploratory Analysis**
1. Target Distribution

Observation: The dataset exhibits class imbalance.

Most frequent roles: backend, frontend

Least frequent role: fullstack

Implications:

Use evaluation metrics robust to imbalance (e.g., Macro F1 score).

Consider techniques like class weighting during model training.

2. Numerical Features

Features analyzed: linesadded, linesdeleted, numfileschanged, numcommentsadded

Observation:

linesadded and linesdeleted distributions are skewed.

numfileschanged and numcommentsadded have visible spread.

Implications:

Skewed features may benefit from scaling (e.g., RobustScaler).

3. Categorical Feature: Commit Type

Observation:

Most common commit types: bugfix, feature

Less common types are present but infrequent.

Processing:

One-hot encoding for traditional models.

Extracted presence of keywords related to commit types.

4. File Extensions

Observation:

Certain file types are more frequently involved in commits.

Extensions were mapped to categories like frontend, backend, etc.

Implications:

Categorical features based on file extensions are informative for role prediction.

5. Commit Messages

Observation:

Distribution of message lengths and word counts indicates typical verbosity.

Content is rich and informative for predicting roles.

Processing:

Extracted keyword-based features.

Fine-tuned LLM for capturing complex text patterns beyond simple features.

Implications:

Using an LLM is justified, as it can capture nuanced patterns in commit messages.

6. Overall Implications

Feature engineering was validated by analysis.

Handling class imbalance is critical.

Commit messages are key predictors, reinforcing the use of LLM-based approaches.