### **1. Introduction**
   - **Overview**: An initial questionnaire is designed to get rudimentary information from the user. The application processes user and course data, normalizing categorical values and encoding interests for numerical analysis. A graph is constructed where users are nodes, and edge weights are assigned based on course similarity, department/faculty match, year proximity, and interest overlap using cosine similarity. Louvain clustering is applied to form optimal groups, with dynamic adjustments to ensure sizes remain between 2-5 members. Leftover users are reassigned based on preference and similarity. NetworkX visualizations illustrate overall and individual group structures, while final assignments and common attributes are saved for analysis.
   - **Key Goals**:
     - Create sample user data with attributes such as course, year, interests, etc.
     - Process data to identify similarities and group users.
     - Assign users to groups based on their preferences and interests.

### **2. Data Generation & Preprocessing**

#### **A. Sample User Data Generation**
   - **Description**: Generating a set of synthetic users with attributes such as course, year, interests, and gender.
   - **Key Steps**:
     - Load course data from `courses.csv`.
     - Randomly assign values for user attributes.
     - Save the generated data to `input.csv`.
   - **Output**: Show a preview of the `input.csv` file with sample data (e.g., Name, Course, Year, Interests).

---

#### **B. Merging Course Information**
   - **Description**: Enhance user data with additional course-related details (e.g., department and faculty).
   - **Key Steps**:
     - Load user and course data.
     - Merge them based on the course name.
     - Save the enhanced data to `input_with_department.csv`.
   - **Output**: Show the merged data (e.g., Course, Department, Faculty).

In [None]:
import csv
import pandas as pd
import random

input_file = "ucl_courses.csv"
output_file = "ucl_courses_modified.csv"

with open(input_file, "r", encoding="utf-8") as infile, open(output_file, "w", newline="", encoding="utf-8") as outfile:
    reader = csv.reader(infile)
    writer = csv.writer(outfile)

    # Read the header
    header = next(reader)
    writer.writerow(["Course Title", "Department", "Faculty"])

    for row in reader:
        course_title = row[0]
        faculty_dept = row[1]  # Format: Faculty | Department

        # Ensure proper splitting
        if " | " in faculty_dept:
            faculty, department = faculty_dept.split(" | ", 1)
        else:
            faculty, department = faculty_dept, "Unknown"

        writer.writerow([course_title, department, faculty])

print(f"Modified CSV saved as {output_file}")

In [2]:
# Load course data from "courses.csv"
course_file = "courses.csv"  # Ensure this file exists
df_courses = pd.read_csv(course_file)
course_list = df_courses["Course Title"].tolist()

# Predefined Values
years = ["Undergrad year 1", "Undergrad year 2", "Undergrad year 3", "Undergrad year 4", "Masters",
        "Postgrad year 1", "Postgrad year 2", "Postgrad year 3"]
genders = ["Male", "Female", "Non-Binary"]
conversation_starters = ["Which year are you in?", "What course are you doing?", "What did you do over the weekend?"]
society_events = ["Skills development related to your course", "Skills development unrelated to your course",
                  "Volunteering", "Sports", "Culture"]
interests = {
    "Science": ["Astronomy", "Quantum Physics", "AI", "Genetics"],
    "Music": ["Pop - Taylor Swift", "Rock - Queen", "Hip-Hop - Drake", "Jazz - Miles Davis"],
    "Books": ["Fantasy - Harry Potter", "Sci-Fi - Dune", "Mystery - Sherlock Holmes"],
    "Sports": ["Soccer", "Basketball - NBA", "Tennis - Federer"],
    "TV Shows": ["Sci-Fi - Stranger Things", "Fantasy - Game of Thrones", "Crime - Breaking Bad"]
}
preferred_group_sizes = [2, 3, 4, 5]
similar_interest_options = [True, False]

# Generate 50 sample users
data = []
for i in range(50):
    name = f"User{i+1}"
    year = random.choice(years)
    gender = random.choice(genders)
    course = random.choice(course_list)
    user_interests = random.sample(sum(interests.values(), []), random.randint(1, 3))  # 1-3 random interests
    conversation_starter = random.choice(conversation_starters)
    society_event = random.choice(society_events)
    group_size = random.choice(preferred_group_sizes)
    similar_interest = random.choice(similar_interest_options)

    data.append([name, course, year, gender, ", ".join(user_interests), conversation_starter, 
                 society_event, group_size, similar_interest])

# Create DataFrame
df_users = pd.DataFrame(data, columns=["Name", "Course", "Year", "Gender", "Topics", 
                                       "Conversation Starter", "Interested society events", 
                                       "Preferred Group Size", "Similar Interest"])

# Save to CSV
df_users.to_csv("input.csv", index=False)

print("Sample input.csv file with 50 people has been generated!")

Sample input.csv file with 50 people has been generated!


### **3. Data Transformation**

#### **A. Year Transformation to Numerical Values**
   - **Description**: Transform the "Year" attribute (e.g., "Undergrad year 1") into numerical values for easier processing.
   - **Key Steps**:
     - Map textual year labels to numerical values.
   - **Output**: Show a snippet of data before and after transformation.

---

#### **B. Course Correlation Matrix**
   - **Description**: Calculate correlations between users based on shared courses, departments, or faculties.
   - **Key Steps**:
     - Compare users' course-related data.
     - Assign correlation scores: High (same course), Medium (same department), Low (same faculty), None (different faculty).
     - Save correlation matrix to `course_correlation_matrix.csv`.
   - **Output**: Display a preview of the correlation matrix (matrix showing similarity scores between users).

---

#### **C. One-Hot Encoding for Interests**
   - **Description**: Convert user interests into a numerical vector using one-hot encoding.
   - **Key Steps**:
     - Flatten the interest categories (e.g., Music, Science, etc.).
     - Apply one-hot encoding to the `Topics` column.
     - Merge these encoded vectors with the main DataFrame.
   - **Output**: Display the updated DataFrame with the one-hot encoded columns for interests.

In [21]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
from collections import Counter

# --- Step 1: Load Data ---
user_file = "input.csv"  # Raw user data
course_file = "courses.csv"  # Course info (Course → Department, Faculty)

df_users = pd.read_csv(user_file)
df_courses = pd.read_csv(course_file)

# Ensure course file has correct columns
df_courses.columns = ["Course Title", "Department", "Faculty"]

# --- Step 3: Merge Course Information ---
df_users = df_users.merge(df_courses, left_on="Course", right_on="Course Title", how="left").drop(columns=["Course Title"])

output_path = "input_with_department.csv"
df_users.to_csv(output_path, index=False)

# --- Step 2: Convert Age to Numerical Year Group ---
year_mapping = {
    "Undergrad year 1": 1, "Undergrad year 2": 2, "Undergrad year 3": 3,
    "Undergrad year 4": 4, "Masters": 4,
    "Postgrad year 1": 5, "Postgrad year 2": 6, "Postgrad year 3": 7
}
df_users["Year"] = df_users["Year"].map(year_mapping)

# --- Step 4: Compute Course Correlation Matrix ---
def calculate_course_correlation(df):
    """
    Assigns a correlation score based on shared course, department, or faculty between users.
    High (3) = Same Course
    Medium (2) = Different Course, Same Department
    Low (1) = Different Course, Different Department, Same Faculty
    None (0) = Different Faculty
    """
    correlation_matrix = pd.DataFrame(0, index=df.index, columns=df.index)

    for i, user1 in df.iterrows():
        for j, user2 in df.iterrows():
            if i >= j:
                continue

            if user1["Course"] == user2["Course"]:
                score = 3
            elif user1["Department"] == user2["Department"]:
                score = 2
            elif user1["Faculty"] == user2["Faculty"]:
                score = 1
            else:
                score = 0

            correlation_matrix.loc[i, j] = score
            correlation_matrix.loc[j, i] = score

    return correlation_matrix

# Compute and save course correlation matrix
course_correlation_matrix = calculate_course_correlation(df_users)
course_correlation_matrix.to_csv("course_correlation_matrix.csv")

print("✅ Course correlation matrix saved.")

# --- Step 5: Convert Topics (Interests) into Numerical Vectors ---
interest_categories = {
    "Science": ["Astronomy", "Quantum Physics", "AI", "Genetics"],
    "Music": ["Pop - Taylor Swift", "Rock - Queen", "Hip-Hop - Drake", "Jazz - Miles Davis"],
    "Books": ["Fantasy - Harry Potter", "Sci-Fi - Dune", "Mystery - Sherlock Holmes"],
    "Sports": ["Soccer", "Basketball - NBA", "Tennis - Federer"],
    "TV Shows": ["Sci-Fi - Stranger Things", "Fantasy - Game of Thrones", "Crime - Breaking Bad"]
}

# Flatten interest list
all_interests = [interest for sublist in interest_categories.values() for interest in sublist]

# Convert "Topics" column into lists
df_users["Topics"] = df_users["Topics"].apply(lambda x: x.split(", ") if isinstance(x, str) else [])

# One-Hot Encoding for Interests
mlb = MultiLabelBinarizer(classes=all_interests)
interest_vectors = mlb.fit_transform(df_users["Topics"])
df_interests = pd.DataFrame(interest_vectors, columns=mlb.classes_)

# Merge Interests with Main DataFrame
df_users = pd.concat([df_users, df_interests], axis=1)


✅ Course correlation matrix saved.
✅ Processed data saved to processed_data.csv


### **4. Weights Assignment for Conversation Starters & Events**

#### **A. Assigning Weights to Conversation Starters**
   - **Description**: Assign weights to different conversation starters based on their relevance to user attributes (course, year, interests).
   - **Key Steps**:
     - Define a weight mapping for conversation starters.
   - **Output**: Display a sample of the data with assigned conversation weights.

#### **B. Assigning Weights to Society Events**
   - **Description**: Assign weights to society events (e.g., skills development, volunteering) based on their connection to user interests.
   - **Key Steps**:
     - Define a weight mapping for events.
   - **Output**: Display the updated data with assigned event weights.

In [None]:
# --- Step 6: Assign Conversation Starter Weights ---
conversation_weights = {
    "Which year are you in?": "year",
    "What course are you doing?": "course",
    "What did you do over the weekend?": "interest"
}
df_users["Conversation Weight"] = df_users["Conversation Starter"].map(conversation_weights)

# --- Step 7: Assign Society Event Weights ---
event_weights = {
    "Skills development related to your course": "course",
    "Skills development unrelated to your course": "interest",
    "Volunteering": "interest",
    "Sports": "interest",
    "Culture": "interest"
}
df_users["Event Weight"] = df_users["Interested society events"].fillna("").apply(
    lambda x: [event_weights.get(event.strip(), "unknown") for event in x.split(",") if event.strip()]
)

# --- Step 8: Process Group Size and Similar Interest ---
df_users["Preferred Group Size"] = df_users["Preferred Group Size"].fillna(1).astype(int)
df_users["Similar Interest"] = df_users["Similar Interest"].astype(bool)

# --- Save Processed Data ---
output_path = "processed_data.csv"
df_users.to_csv(output_path, index=False)

print(f"✅ Processed data saved to {output_path}")

### **5. Group Assignment Based on Interests**

#### **A. Group Assignment Algorithm**
   - **Description**: Group users based on their interests and preferred group size. Consider whether users have similar interests (based on user input) to decide if they should be grouped together.
   - **Key Steps**:
     - Iterate over unassigned users.
     - Check if the user’s group size preference is satisfied.
     - Assign groups based on shared interests.
   - **Output**: Display the final grouped data (e.g., User Name, Group ID).

In [23]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer

# Ensure Similar Interest and Preferred Group Size are in correct format
df_users["Similar Interest"] = df_users["Similar Interest"].fillna(False).astype(bool)
df_users["Preferred Group Size"] = df_users["Preferred Group Size"].fillna(1).astype(int)

# Extract interest columns (assuming they are between "Astronomy" and the last user-defined column)
interest_columns = df_users.iloc[:, df_users.columns.get_loc("Astronomy"):-2]
interest_columns = interest_columns.apply(pd.to_numeric, errors='coerce').fillna(0).astype(int)

# Initialize Group Assignment
df_users["Assigned Group"] = -1  # Placeholder for groups
group_id = 1  # Group counter

# Create a set of unassigned users
unassigned_users = set(df_users.index)

# Grouping Algorithm
while unassigned_users:
    current_user = unassigned_users.pop()
    current_group = [current_user]
    preferred_size = df_users.loc[current_user, "Preferred Group Size"]
    similar_interest = df_users.loc[current_user, "Similar Interest"]

    to_remove = []
    for other_user in unassigned_users:
        if len(current_group) >= preferred_size:
            break  # Stop adding if group size is reached

        # Interest similarity check
        common_interests = np.dot(interest_columns.loc[current_user], interest_columns.loc[other_user])
        if (similar_interest and common_interests > 0) or (not similar_interest and common_interests == 0):
            current_group.append(other_user)
            to_remove.append(other_user)

    # Remove assigned users from the pool
    for user in to_remove:
        unassigned_users.remove(user)

    # Assign group ID
    df_users.loc[current_group, "Assigned Group"] = group_id
    group_id += 1

# Save the updated grouped data
output_grouped_path = "grouped_users.csv"
df_users.to_csv(output_grouped_path, index=False)

print(f"Grouped data saved to {output_grouped_path}")

Grouped data saved to grouped_users.csv


### **6. Identifying Common Group Attributes**

#### **A. Common Group Attributes Analysis**
   - **Description**: Identify common attributes (e.g., common course, year, interests) within each group.
   - **Key Steps**:
     - For each group, find the most common year, course, department, and interest.
   - **Output**: Show common attributes for each group (e.g., most common course or interest).

In [25]:
# --- Display Users by Group ---
grouped_users = df_users.groupby("Assigned Group")

for group_id, group_data in grouped_users:
    print(f"\n===== Group {group_id} =====")
    print(group_data[["Name", "Course", "Preferred Group Size", "Similar Interest"]])


Grouped data saved to grouped_users.csv

===== Group 1 =====
    Name                   Course  Preferred Group Size  Similar Interest
0  User1       Earth Sciences BSc                     2              True
6  User7  French and Hungarian BA                     4             False

===== Group 2 =====
    Name                    Course  Preferred Group Size  Similar Interest
1  User2     Finnish and French BA                     4             False
3  User4  Norwegian and Russian BA                     2              True
4  User5   Cancer Biomedicine MSci                     4             False
8  User9              Icelandic BA                     2              True

===== Group 3 =====
      Name                                 Course  Preferred Group Size  \
2    User3  Ancient Languages with Year Abroad BA                     5   
5    User6                French and Norwegian BA                     5   
9   User10                 Arts and Sciences BASc                     3   


# Save course list

In [None]:
import requests
from bs4 import BeautifulSoup
import csv
import time

# URL of the UCL undergraduate degrees page
url = 'https://www.ucl.ac.uk/prospective-students/undergraduate/degrees'

# Send request to fetch the page
response = requests.get(url)
if response.status_code != 200:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")
    exit()

# Parse the page content using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

# Find all course listings
courses = soup.find_all('div', class_='result-item clearfix')

# Extract course details
data = []
for course in courses:
    title_tag = course.find('a')
    title = title_tag.get_text(strip=True) if title_tag else 'N/A'
    link = title_tag['href'] if title_tag else 'N/A'
    faculty_dept_tags = course.find_all('span', class_='search-results__dept')
    faculty_dept = faculty_dept_tags[0].get_text(strip=True) if faculty_dept_tags else 'N/A'
    
    data.append([title, faculty_dept, link])
    
    # Pause between requests to be polite to the server
    time.sleep(1)

# Save data to CSV
with open('ucl_courses.csv', 'w', newline='', encoding='utf-8') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['Course Title', 'Faculty and Department', 'Link'])
    writer.writerows(data)

print("Scraping completed. Data saved to ucl_courses.csv")