# The Leetcode Problem Dataset
Here, I will explore and visualize this dataset to gain insights and identify patterns. The dataset contains information about LeetCode problems, including their difficulty levels, tags, and other relevant attributes. By visualizing this data, we can better understand the distribution of problem difficulties, the most common tags, and other interesting trends.

In [31]:
import os
import pandas as pd
from sklearn.feature_selection import mutual_info_classif

In [61]:
csv_file = '/datasets/leetcode_dataset.csv'
csv_full_path = os.getcwd()+csv_file
if os.path.exists(csv_full_path):
    leetcode_problems = pd.read_csv(csv_full_path)
else:
    leetcode_problems = None
    print("File not found.")

In [62]:
selected_columns = ["id", "title", "description", "difficulty", "acceptance_rate", "frequency", "discuss_count", "likes", "dislikes", "rating"]

In [63]:
leetcode_problems = leetcode_problems.filter(items=selected_columns)

In [64]:
leetcode_problems.head()

Unnamed: 0,id,title,description,difficulty,acceptance_rate,frequency,discuss_count,likes,dislikes,rating
0,1,Two Sum,Given an array of integers `nums` and an integ...,Easy,46.7,100.0,999,20217,712,97
1,2,Add Two Numbers,You are given two non-empty linked lists repre...,Medium,35.7,93.1,999,11350,2704,81
2,3,Longest Substring Without Repeating Characters,"Given a string `s`, find the length of the lon...",Medium,31.5,90.9,999,13810,714,95
3,4,Median of Two Sorted Arrays,Given two sorted arrays `nums1` and `nums2` of...,Hard,31.4,86.2,999,9665,1486,87
4,5,Longest Palindromic Substring,"Given a string `s`, return the longest palindr...",Medium,30.6,84.7,999,10271,670,94


In [14]:
leetcode_problems.shape

(1825, 10)

In [15]:
difficulty_counts = leetcode_problems.groupby('difficulty')
difficulty_counts.size()

difficulty
Easy      477
Hard      385
Medium    963
dtype: int64

## Selection of the Problems
To choose the problems I want to include in my research, I plan to analyze how certain factors correlate with one another. Specifically, I will look at the relationship between the rating (calculated as likes / (likes + dislikes)) and the frequency of each problem in comparison to its acceptance rate, difficulty level, and the number of discussion posts.

The correlation coefficient is between -1 and 1, where:
- 1 indicates a perfect positive correlation,
- 1 indicates a perfect negative correlation,
- 0 indicates no correlation.

In [38]:
def compute_correlation(feature_1, feature_2):
    return leetcode_problems[feature_1].corr(leetcode_problems[feature_2], method='pearson')

def compute_mutual_information(feature_1, feature_2):
    return mutual_info_classif(leetcode_problems[feature_1].to_frame(), leetcode_problems[feature_2], discrete_features=False)

In [39]:
def correlation_analysis():
    print("Correlation/Mutual Information between user preferences and acceptance rate, number of disscusion posts, and difficulty level.")
    print(f"Rating & Acceptance Rate: {compute_correlation('rating', 'acceptance_rate')}")
    print(f"Rating & Discuss Count: {compute_correlation('rating', 'discuss_count')}")
    print(f"Rating & Difficulty Level: {compute_mutual_information('rating', 'difficulty')}")

    print(f"Likes & Acceptance Rate: {compute_correlation('likes', 'acceptance_rate')}")
    print(f"Likes & Discuss Count: {compute_correlation('likes', 'discuss_count')}")
    print(f"Likes & Difficulty Level: {compute_mutual_information('rating', 'difficulty')}")
    
    print(f"Dislikes & Acceptance Rate: {compute_correlation('dislikes', 'acceptance_rate')}")
    print(f"Dislikes & Discuss Count: {compute_correlation('dislikes', 'discuss_count')}")
    print(f"Dislikes & Difficulty Level: {compute_mutual_information('rating', 'difficulty')}")
    
    print("\n")
    print("Correlation/Mutual Information between problems' frequency and acceptance rate, number of disscusion posts, and difficulty level.")
    print(f"Frequency & Acceptance Rate: {compute_correlation('frequency', 'acceptance_rate')}")
    print(f"Frequency & Discuss Count: {compute_correlation('frequency', 'discuss_count')}")
    print(f"Frequncy & Difficulty Level: {compute_mutual_information('frequency', 'difficulty')}")


In [40]:
correlation_analysis()

Correlation/Mutual Information between user preferences and acceptance rate, number of disscusion posts, and difficulty level.
Rating & Acceptance Rate: 0.10283248334628776
Rating & Discuss Count: 0.14597714752434862
Rating & Difficulty Level: [0.02235389]
Likes & Acceptance Rate: -0.16477436888885752
Likes & Discuss Count: 0.6683382551951363
Likes & Difficulty Level: [0.01119193]
Dislikes & Acceptance Rate: -0.1792639970313835
Dislikes & Discuss Count: 0.30764244244601957
Dislikes & Difficulty Level: [0.01853596]


Correlation/Mutual Information between problems' frequency and acceptance rate, number of disscusion posts, and difficulty level.
Frequency & Acceptance Rate: -0.23185275093821459
Frequency & Discuss Count: 0.519824075078056
Frequncy & Difficulty Level: [0.02578374]


The analysis shows that user engagement, particularly through likes and discussion posts, is more closely related to problem popularity than acceptance rate or difficulty level. Problems with more discussions tend to receive higher ratings and more likes, while problems with higher acceptance rates tend to receive fewer likes and slightly more dislikes. Interestingly, the difficulty level has minimal influence on user preferences or problem frequency. Additionally, problems that appear more frequently are more discussed but tend to have lower acceptance rates. Overall, user preferences are driven more by how often a problem sparks discussion rather than how easy or hard it is.

### Getting k Most Popular Problems
I calculate the popularity of a problem using its rating score.

In [41]:
def get_top_k_problems(group, k):
    return group.sort_values(by='rating', ascending=False).head(k)

In [48]:
def get_k_most_popular_problems(k, difficulty):
    difficulty_group = leetcode_problems[(leetcode_problems['difficulty'] == difficulty) & (leetcode_problems['description'] != 'SQL Schema')]
    top_k_problems_by_category = difficulty_group.groupby('difficulty', group_keys=False).apply(get_top_k_problems, k=k)
    return top_k_problems_by_category[["id", "title", "description", "difficulty", "rating"]]

In [57]:
selected_problems = pd.concat([get_k_most_popular_problems(10, 'Easy'), get_k_most_popular_problems(10, 'Medium'), get_k_most_popular_problems(10, 'Hard')], ignore_index=True)
selected_problems = selected_problems[["title", "description", "difficulty"]]

  top_k_problems_by_category = difficulty_group.groupby('difficulty', group_keys=False).apply(get_top_k_problems, k=k)
  top_k_problems_by_category = difficulty_group.groupby('difficulty', group_keys=False).apply(get_top_k_problems, k=k)
  top_k_problems_by_category = difficulty_group.groupby('difficulty', group_keys=False).apply(get_top_k_problems, k=k)


In [58]:
selected_problems.shape

(30, 3)

In [59]:
print(selected_problems['description'][0])

You are given an integer array `score` of size `n`, where `score[i]` is the score of the `ith` athlete in a competition. All the scores are guaranteed to be unique.

The athletes are placed based on their scores, where the `1st` place athlete has the highest score, the `2nd` place athlete has the `2nd` highest score, and so on. The placement of each athlete determines their rank:
The `1st` place athlete's rank is `"Gold Medal"`.

The `2nd` place athlete's rank is `"Silver Medal"`.

The `3rd` place athlete's rank is `"Bronze Medal"`.

For the `4th` place to the `nth` place athlete, their rank is their placement number (i.e., the `xth` place athlete's rank is `"x"`).

Return an array `answer` of size `n` where `answer[i]` is the rank of the `ith` athlete.


Example 1:
Input: score = [5,4,3,2,1]
Output: ["Gold Medal","Silver Medal","Bronze Medal","4","5"]
Explanation: The placements are [1st, 2nd, 3rd, 4th, 5th].


Example 2:
Input: score = [10,3,8,9,4]
Output: ["Gold Medal","5","Bronze M

In [60]:
selected_problems.to_csv('datasets/leetcode_selected_problems.csv', index=False)