# Leetcode Questions Dataset

## Problem Statement  
Leetcode is a widely used platform for coding practice and technical interview preparation. The dataset contains information on Leetcode problems, including their difficulty, topics, and other metadata. Analyzing this dataset can provide insights into problem distribution, topic trends, and difficulty levels, aiding learners in optimizing their preparation strategy.

## Objective  
- To analyze the Leetcode dataset to identify patterns in coding problems.  
- To categorize problems based on topics, difficulty, and acceptance rate.  
- To assist users in planning their coding practice by understanding the distribution of questions.

## Data Description  

| Column Name            | Description                                              |
|------------------------|----------------------------------------------------------|
| `question_id`         | Unique identifier for each problem                        |
| `Question`               | Name of the Leetcode question                             |
| `isPremium`               | is the user Premium (`True`, `False`)                             |
| `difficulty`          | Difficulty level (`Easy`, `Medium`, `Hard`)              |
| `acceptance_rate`    | Percentage of successful submissions                      |
| `topic_tags`         | Topics associated with the problem (e.g., `Array`, `DP`)  |
| `submission_count`   | Total number of submissions made for the problem          |
| `total_accepted`     | Number of successful submissions                          |
| `Question_Link`               | Link to the problem on Leetcode                            |
| `Solution`               | Link to the problem solution on Leetcode                            |



In [166]:
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt

In [167]:
data=pd.read_csv('Leetcode_Questions_updated.csv')
data.head()

Unnamed: 0,Question_No,Question,Topic_tags,Acceptance_rate,isPremium,Difficulty,Question_Link,Solution
0,1,Two Sum,"['Array', 'Hash Table']",54.10%,False,Easy,https://leetcode.com/problems/two-sum/description,https://leetcode.com/problems/two-sum/solutions
1,2,Add Two Numbers,"['Linked List', 'Math', 'Recursion']",44.50%,False,Medium,https://leetcode.com/problems/add-two-numbers/...,https://leetcode.com/problems/add-two-numbers/...
2,3,Longest Substring Without Repeating Characters,"['Hash Table', 'String', 'Sliding Window']",35.70%,False,Medium,https://leetcode.com/problems/longest-substrin...,https://leetcode.com/problems/longest-substrin...
3,4,Median of Two Sorted Arrays,"['Array', 'Binary Search', '1+']",41.80%,False,Hard,https://leetcode.com/problems/median-of-two-so...,https://leetcode.com/problems/median-of-two-so...
4,5,Longest Palindromic Substring,"['Two Pointers', 'String', '1+']",34.70%,False,Medium,https://leetcode.com/problems/longest-palindro...,https://leetcode.com/problems/longest-palindro...


In [168]:
data.set_index(data['Question_No'],inplace=True)
data.drop(columns='Question_No',inplace=True)
data.head()

Unnamed: 0_level_0,Question,Topic_tags,Acceptance_rate,isPremium,Difficulty,Question_Link,Solution
Question_No,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Two Sum,"['Array', 'Hash Table']",54.10%,False,Easy,https://leetcode.com/problems/two-sum/description,https://leetcode.com/problems/two-sum/solutions
2,Add Two Numbers,"['Linked List', 'Math', 'Recursion']",44.50%,False,Medium,https://leetcode.com/problems/add-two-numbers/...,https://leetcode.com/problems/add-two-numbers/...
3,Longest Substring Without Repeating Characters,"['Hash Table', 'String', 'Sliding Window']",35.70%,False,Medium,https://leetcode.com/problems/longest-substrin...,https://leetcode.com/problems/longest-substrin...
4,Median of Two Sorted Arrays,"['Array', 'Binary Search', '1+']",41.80%,False,Hard,https://leetcode.com/problems/median-of-two-so...,https://leetcode.com/problems/median-of-two-so...
5,Longest Palindromic Substring,"['Two Pointers', 'String', '1+']",34.70%,False,Medium,https://leetcode.com/problems/longest-palindro...,https://leetcode.com/problems/longest-palindro...


In [169]:
data.drop(columns=['Question_Link','Solution'],inplace=True)
data.head()

Unnamed: 0_level_0,Question,Topic_tags,Acceptance_rate,isPremium,Difficulty
Question_No,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Two Sum,"['Array', 'Hash Table']",54.10%,False,Easy
2,Add Two Numbers,"['Linked List', 'Math', 'Recursion']",44.50%,False,Medium
3,Longest Substring Without Repeating Characters,"['Hash Table', 'String', 'Sliding Window']",35.70%,False,Medium
4,Median of Two Sorted Arrays,"['Array', 'Binary Search', '1+']",41.80%,False,Hard
5,Longest Palindromic Substring,"['Two Pointers', 'String', '1+']",34.70%,False,Medium


## Descriptive Analysis

In [170]:
data.shape

(3306, 5)

In [171]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3306 entries, 1 to 3339
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Question         3306 non-null   object
 1   Topic_tags       3306 non-null   object
 2   Acceptance_rate  3306 non-null   object
 3   isPremium        3306 non-null   bool  
 4   Difficulty       3306 non-null   object
dtypes: bool(1), object(4)
memory usage: 132.4+ KB


In [172]:
data.describe().T

Unnamed: 0,count,unique,top,freq
Question,3306,3306,Two Sum,1
Topic_tags,3306,759,['Database'],284
Acceptance_rate,3306,706,51.60%,14
isPremium,3306,2,False,2630
Difficulty,3306,3,Medium,1733


**Inferences :**
- There are more questions on the `Dataase`.
- Most of the questions are equily likely to get accepted
- Most of the users are who `don't have a premium account`
- Majourity of the attempts are made on the `Medium` level questions

In [173]:
# treating the acceptence rate 
# converting the string format to numerical
# eg: 51.3% >>> 51.3
data['Acceptance_rate']=data['Acceptance_rate'].apply(lambda x: float(x[:-1]))
data.head()

Unnamed: 0_level_0,Question,Topic_tags,Acceptance_rate,isPremium,Difficulty
Question_No,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Two Sum,"['Array', 'Hash Table']",54.1,False,Easy
2,Add Two Numbers,"['Linked List', 'Math', 'Recursion']",44.5,False,Medium
3,Longest Substring Without Repeating Characters,"['Hash Table', 'String', 'Sliding Window']",35.7,False,Medium
4,Median of Two Sorted Arrays,"['Array', 'Binary Search', '1+']",41.8,False,Hard
5,Longest Palindromic Substring,"['Two Pointers', 'String', '1+']",34.7,False,Medium


In [174]:
data['Acceptance_rate'].describe()

count    3306.000000
mean       56.397520
std        16.343856
min        10.400000
25%        44.300000
50%        56.300000
75%        68.100000
max        96.100000
Name: Acceptance_rate, dtype: float64

**Inference :**
- There are `no null` values in the data
- The Worst acceptance rate can be `10.4%`
- The mean acceptance rate is `56.39%` with standard deviation of `16.34`

In [175]:
# Ho: The true mean acceptance rate equals 56.39%
# 𝐻a:The true mean acceptance rate differs from 56.39%
stats.ttest_1samp(data['Acceptance_rate'], popmean=56.39)

TtestResult(statistic=0.02645423264580009, pvalue=0.9788966349938212, df=3305)

In [176]:
import ast
data['Topic_tags'] = data['Topic_tags'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)

In [177]:
data.head()

Unnamed: 0_level_0,Question,Topic_tags,Acceptance_rate,isPremium,Difficulty
Question_No,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Two Sum,"[Array, Hash Table]",54.1,False,Easy
2,Add Two Numbers,"[Linked List, Math, Recursion]",44.5,False,Medium
3,Longest Substring Without Repeating Characters,"[Hash Table, String, Sliding Window]",35.7,False,Medium
4,Median of Two Sorted Arrays,"[Array, Binary Search, 1+]",41.8,False,Hard
5,Longest Palindromic Substring,"[Two Pointers, String, 1+]",34.7,False,Medium


In [178]:
tags=[]
for i in data['Topic_tags']:
    for j in i:
        tags.append(j)
l1=pd.Series(sorted(tags))
l1.drop_duplicates(inplace=True)
tags=l1.values.tolist()
tags

['1+',
 '2+',
 '3+',
 '4+',
 '5+',
 '6+',
 '7+',
 'Array',
 'Backtracking',
 'Binary Indexed Tree',
 'Binary Search',
 'Binary Search Tree',
 'Binary Tree',
 'Bit Manipulation',
 'Brainteaser',
 'Breadth-First Search',
 'Bucket Sort',
 'Combinatorics',
 'Concurrency',
 'Counting',
 'Counting Sort',
 'Data Stream',
 'Database',
 'Depth-First Search',
 'Design',
 'Divide and Conquer',
 'Doubly-Linked List',
 'Dynamic Programming',
 'Enumeration',
 'Game Theory',
 'Geometry',
 'Graph',
 'Greedy',
 'Hash Function',
 'Hash Table',
 'Heap (Priority Queue)',
 'Interactive',
 'Iterator',
 'Line Sweep',
 'Linked List',
 'Math',
 'Matrix',
 'Memoization',
 'Monotonic Stack',
 'Number Theory',
 'Ordered Set',
 'Prefix Sum',
 'Probability and Statistics',
 'Queue',
 'Randomized',
 'Recursion',
 'Rejection Sampling',
 'Rolling Hash',
 'Segment Tree',
 'Shell',
 'Shortest Path',
 'Simulation',
 'Sliding Window',
 'Sorting',
 'Stack',
 'String',
 'String Matching',
 'Topological Sort',
 'Tree',
 'Tri

In [179]:
from sklearn.preprocessing import MultiLabelBinarizer
import ast
data['Topic_tags'] = data['Topic_tags'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)
mlb = MultiLabelBinarizer()
topic_encoded = pd.DataFrame(mlb.fit_transform(data['Topic_tags']), columns=tags, index=data.index)
data_final = data.join(topic_encoded)
data_final.head()

Unnamed: 0_level_0,Question,Topic_tags,Acceptance_rate,isPremium,Difficulty,1+,2+,3+,4+,5+,...,Sliding Window,Sorting,Stack,String,String Matching,Topological Sort,Tree,Trie,Two Pointers,Union Find
Question_No,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,Two Sum,"[Array, Hash Table]",54.1,False,Easy,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Add Two Numbers,"[Linked List, Math, Recursion]",44.5,False,Medium,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Longest Substring Without Repeating Characters,"[Hash Table, String, Sliding Window]",35.7,False,Medium,0,0,0,0,0,...,1,0,0,1,0,0,0,0,0,0
4,Median of Two Sorted Arrays,"[Array, Binary Search, 1+]",41.8,False,Hard,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,Longest Palindromic Substring,"[Two Pointers, String, 1+]",34.7,False,Medium,1,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0


In [180]:
data_final.shape

(3306, 72)

# Using `data_final` for the Analysis

In [181]:
import statsmodels.api as sma
from sklearn.preprocessing import LabelEncoder

In [182]:
x=data_final.drop(columns=['Question','Topic_tags','Acceptance_rate'])
y=data_final['Acceptance_rate']

In [183]:
encod=LabelEncoder()
data_final['isPremium']=encod.fit_transform(data_final['isPremium'])
x['isPremium']=encod.fit_transform(x['isPremium'])

In [184]:
encod=LabelEncoder()
data_final['Difficulty']=encod.fit_transform(data_final['Difficulty'])
x['Difficulty']=encod.fit_transform(x['Difficulty'])

In [185]:
x.head()

Unnamed: 0_level_0,isPremium,Difficulty,1+,2+,3+,4+,5+,6+,7+,Array,...,Sliding Window,Sorting,Stack,String,String Matching,Topological Sort,Tree,Trie,Two Pointers,Union Find
Question_No,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,0,2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,2,0,0,0,0,0,0,0,0,...,1,0,0,1,0,0,0,0,0,0
4,0,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
5,0,2,1,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0


In [188]:
x.shape,y.shape

((3306, 69), (3306,))

In [191]:
model=sma.OLS(y,x).fit()
model.summary()

0,1,2,3
Dep. Variable:,Acceptance_rate,R-squared (uncentered):,0.848
Model:,OLS,Adj. R-squared (uncentered):,0.845
Method:,Least Squares,F-statistic:,261.4
Date:,"Sun, 23 Feb 2025",Prob (F-statistic):,0.0
Time:,11:35:40,Log-Likelihood:,-15043.0
No. Observations:,3306,AIC:,30220.0
Df Residuals:,3237,BIC:,30650.0
Df Model:,69,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
isPremium,9.9060,1.117,8.867,0.000,7.715,12.097
Difficulty,1.1205,0.508,2.207,0.027,0.125,2.116
1+,-6.6678,1.366,-4.882,0.000,-9.346,-3.990
2+,-2.8491,1.449,-1.966,0.049,-5.691,-0.007
3+,-0.0329,1.762,-0.019,0.985,-3.487,3.421
4+,-2.8882,3.004,-0.961,0.336,-8.778,3.002
5+,-6.6173,4.621,-1.432,0.152,-15.678,2.443
6+,-2.2622,7.870,-0.287,0.774,-17.693,13.169
7+,-15.8017,23.632,-0.669,0.504,-62.136,30.533

0,1,2,3
Omnibus:,354.157,Durbin-Watson:,1.43
Prob(Omnibus):,0.0,Jarque-Bera (JB):,579.386
Skew:,0.756,Prob(JB):,1.54e-126
Kurtosis:,4.386,Cond. No.,101.0


### **Inferences from the OLS Regression Results:**

- **Users with a premium account (`isPremium`) have a significantly higher acceptance rate** than non-premium users, as indicated by the **positive and significant coefficient (9.906, p=0.000)**.  
- **Difficulty level impacts the acceptance rate**: Harder questions tend to have a slightly higher acceptance rate (**Difficulty coefficient = 1.120, p=0.027**), but the effect is small.  
- **Some topics significantly increase acceptance rates**:  
  - `Database (55.12, p=0.000)`, `Dynamic Programming (21.95, p=0.000)`, `Graph (22.39, p=0.000)`, `String (26.63, p=0.000)`, and `Tree (27.29, p=0.000)` are among the most accepted topics.  
- **Certain topics negatively impact acceptance rates**, such as `1+ (-6.66, p=0.000)`, `2+ (-2.84, p=0.049)`, and `5+ (-6.61, p=0.152)`, suggesting that these categories may be more challenging.  
- **The presence of algorithmic topics (e.g., Backtracking, Bit Manipulation, Greedy, Sorting) increases acceptance rates**, meaning well-structured problems are more likely to be solved successfully.  
- **Some topics have an insignificant effect on acceptance rates** (e.g., `Bucket Sort (p=0.733)`, `Game Theory (p=0.527)`, `Shortest Path (p=0.558)`), implying they may not strongly influence problem acceptance.  
- **Most users attempting problems do not have a premium account**, as seen in the significant impact of `isPremium`.  
- **The majority of attempts are made on `Medium` level questions**, as inferred from the significant but small effect of `Difficulty`.  
- **Problems related to `Math`, `Graph`, and `Dynamic Programming` are more likely to be accepted**, suggesting they are well-understood or widely practiced topics.  
- **The model is highly significant**, as indicated by the **F-statistic (261.4, p=0.000)** and **Adjusted R² (0.845)**, meaning it explains a large portion of the variability in acceptance rates.


### **Summary of Inferences:**

- The dataset contains **no null values** and has a **mean acceptance rate of 56.39%** with a **worst acceptance rate of 10.4%**.  
- **Most of the questions belong to the `Database` category**, and **the majority of attempts are made on `Medium` level questions**.  
- **Users without a premium account make most of the attempts**, but **premium users have a significantly higher acceptance rate** (+9.906%).  
- **Difficulty level slightly impacts acceptance rates**, with harder questions having a **slightly higher** acceptance rate, possibly due to selective attempts by skilled users.  
- **Topics like `Database`, `Graph`, `Dynamic Programming`, `String`, and `Tree` significantly increase acceptance rates**, suggesting they are well-practiced or structured.  
- **Some topics negatively impact acceptance rates**, indicating they may be more challenging or attempted by less-prepared users.  
- **Algorithmic topics (`Backtracking`, `Bit Manipulation`, `Greedy`, `Sorting`) increase acceptance rates**, likely due to structured solutions.  
- **Certain topics (e.g., `Bucket Sort`, `Game Theory`, `Shortest Path`) have an insignificant effect**, meaning they do not strongly influence problem acceptance.  
- The **regression model is highly significant (F-statistic: 261.4, Adjusted R²: 0.845)**, meaning it explains a substantial portion of the variance in acceptance rates.  

These insights help understand user behavior, question difficulty, and factors affecting problem acceptance rates. 🚀
