# Do you want to know why you lose hair?

## 📖 Background
As we age, hair loss becomes one of the health concerns of many people. The fullness of hair not only affects appearance, but is also closely related to an individual's health.

A survey brings together a variety of factors that may contribute to hair loss, including genetic factors, hormonal changes, medical conditions, medications, nutritional deficiencies, psychological stress, and more. Through data exploration and analysis, the potential correlation between these factors and hair loss can be deeply explored, thereby providing useful reference for the development of individual health management, medical intervention and related industries.

## 💪 Competition challenge

Choose your difficulty level! You decide how challenging this competition will be. Every level will have an equal chance to win!

### Level 1: Descriptive statistics:
1. What is the average age? What is the age distribution?
2. Which medical conditions are the most common? How often do they occur?
3. What types of nutritional deficiencies are there and how often do they occur?

### Level 2: Visualization:

1. What is the proportion of patients with hair loss in different age groups?
2. What factors are associated with hair loss? 
3. What does hair loss look like under different stress levels? 

### Level 3: Machine learning:

1. A classification model can be built to predict whether an individual will suffer from hair loss based on given factors.
2. Use cluster analysis to explore whether there are different types of hair loss groups in the data set.
3. Use algorithms such as decision trees or random forests to identify the key factors that best predict hair loss.

## 🧑‍⚖️ Judging criteria

Depending on your skill level, decide on which aspect you want to focus. Each difficulty level (descriptive statistics, visualization, or machine learning) has an equal chance to win.
A well-written descriptive analysis is better than a poorly executed machine learning attempt!

There will only be three winners. The **top 3** will contain the best entry for each difficulty level.

Judging will happen as follows:

| CATEGORY | WEIGHTING | DETAILS                                                              |
|:---------|:----------|:---------------------------------------------------------------------|
| **Recommendations** | 35%       | <ul><li>Clarity of recommendations - how clear and well presented the recommendation is.</li><li>Quality of recommendations - are appropriate analytical techniques used & are the conclusions valid?</li><li>Quality of the executive summary.</li></ul>       |
| **Storytelling**  | 35%       | <ul><li>How well the data and insights are connected to the recommendations.</li><li>How the narrative and whole report connects together.</li><li>Balancing making the report in-depth enough but also concise.</li></ul> |
| **Visualizations** (if applicable) | 20% | <ul><li>Appropriateness of visualization used.</li><li>Clarity of insight from visualization.</li></ul> |
| **Votes** | 10% | <ul><li>Up voting - most upvoted entries get the most points.</li></ul> |

## 🏆 Prizes

| Ranking | Prize |
|:---------|:----------|
| **1st Prize** | **$500**     |
| **2nd Prize**  | **$400**   |
| **3rd Prize** | **$300**  |


## ✅ Checklist before publishing into the competition
- Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
- **Remove redundant cells** like the judging criteria, so the workbook is focused on your story.
- Make sure the workbook reads well and explains how you found your insights. 
- Include an **executive summary** of your recommendations at the beginning.
- Check that all the cells run without error

## ⌛️ Time is ticking. Good luck!

## 💾 The data

The survey provides the information you need in the `Predict Hair Fall.csv` in the `data` folder.

#### Data contains information on persons in this survey. Each row represents one person.
- "Id" - A unique identifier for each person.
- "Genetics" - Whether the person has a family history of baldness.
- "Hormonal Changes" - Indicates whether the individual has experienced hormonal changes (Yes/No).
- "Medical Conditions" - Medical history that may lead to baldness; alopecia areata, thyroid problems, scalp infections, psoriasis, dermatitis, etc.
- "Medications & Treatments" - History of medications that may cause hair loss; chemotherapy, heart medications, antidepressants, steroids, etc.
- "Nutritional Deficiencies" - Lists nutritional deficiencies that may contribute to hair loss, such as iron deficiency, vitamin D deficiency, biotin deficiency, omega-3 fatty acid deficiency, etc.
- "Stress" - Indicates the stress level of the individual (Low/Moderate/High).
- "Age" - Represents the age of the individual.
- "Poor Hair Care Habits" - Indicates whether the individual practices poor hair care habits (Yes/No).
- "Environmental Factors" - Indicates whether the individual is exposed to environmental factors that may contribute to hair loss (Yes/No).
- "Smoking" - Indicates whether the individual smokes (Yes/No).
- "Weight Loss" - Indicates whether the individual has experienced significant weight loss (Yes/No).
- "Hair Loss" - Binary variable indicating the presence (1) or absence (0) of baldness in the individual.

In [1]:
import pandas as pd
data = pd.read_csv('data/Predict Hair Fall.csv')
data.head(10)

Unnamed: 0,Id,Genetics,Hormonal Changes,Medical Conditions,Medications & Treatments,Nutritional Deficiencies,Stress,Age,Poor Hair Care Habits,Environmental Factors,Smoking,Weight Loss,Hair Loss
0,133992,Yes,No,No Data,No Data,Magnesium deficiency,Moderate,19,Yes,Yes,No,No,0
1,148393,No,No,Eczema,Antibiotics,Magnesium deficiency,High,43,Yes,Yes,No,No,0
2,155074,No,No,Dermatosis,Antifungal Cream,Protein deficiency,Moderate,26,Yes,Yes,No,Yes,0
3,118261,Yes,Yes,Ringworm,Antibiotics,Biotin Deficiency,Moderate,46,Yes,Yes,No,No,0
4,111915,No,No,Psoriasis,Accutane,Iron deficiency,Moderate,30,No,Yes,Yes,No,1
5,139661,Yes,No,Psoriasis,Antibiotics,Magnesium deficiency,Low,37,No,Yes,No,Yes,1
6,169255,Yes,Yes,No Data,No Data,Selenium deficiency,High,40,Yes,No,No,No,1
7,112032,Yes,No,Dermatosis,Chemotherapy,Omega-3 fatty acids,High,35,Yes,No,Yes,No,0
8,140785,Yes,No,Eczema,Steroids,Selenium deficiency,Moderate,19,No,No,Yes,Yes,1
9,187999,No,Yes,Ringworm,Rogaine,Magnesium deficiency,Moderate,49,Yes,Yes,Yes,No,0


In [2]:
data.columns = data.columns.str.strip()

In [3]:
data.describe()

Unnamed: 0,Id,Age,Hair Loss
count,999.0,999.0,999.0
mean,153354.673674,34.188188,0.497497
std,25516.041985,9.37798,0.500244
min,110003.0,18.0,0.0
25%,131867.5,26.0,0.0
50%,152951.0,34.0,0.0
75%,174969.0,42.0,1.0
max,199949.0,50.0,1.0


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999 entries, 0 to 998
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Id                        999 non-null    int64 
 1   Genetics                  999 non-null    object
 2   Hormonal Changes          999 non-null    object
 3   Medical Conditions        999 non-null    object
 4   Medications & Treatments  999 non-null    object
 5   Nutritional Deficiencies  999 non-null    object
 6   Stress                    999 non-null    object
 7   Age                       999 non-null    int64 
 8   Poor Hair Care Habits     999 non-null    object
 9   Environmental Factors     999 non-null    object
 10  Smoking                   999 non-null    object
 11  Weight Loss               999 non-null    object
 12  Hair Loss                 999 non-null    int64 
dtypes: int64(3), object(10)
memory usage: 101.6+ KB


In [5]:
data['Hair Loss'].value_counts()

Hair Loss
0    502
1    497
Name: count, dtype: int64

In [6]:
age_18_data = data[data["Age"] == 18]

print(age_18_data)

         Id Genetics Hormonal Changes  ... Smoking Weight Loss Hair Loss
47   167156       No               No  ...     Yes          No         1
48   150641      Yes               No  ...      No         Yes         1
82   123676       No              Yes  ...      No          No         0
113  169244       No              Yes  ...      No          No         0
137  150939      Yes               No  ...      No         Yes         1
139  121433       No               No  ...      No          No         0
175  116141       No               No  ...      No         Yes         1
184  199326      Yes              Yes  ...     Yes         Yes         0
201  161651      Yes              Yes  ...     Yes          No         0
219  155143       No              Yes  ...     Yes          No         0
229  175461      Yes              Yes  ...      No         Yes         1
235  134890       No              Yes  ...     Yes          No         1
239  125449      Yes               No  ...     Yes 

In [7]:
bins = [17, 25, 35, 45, 50]
labels = ["18-25", "26-35", "36-45", "46-50"]

data["Age Group"] = pd.cut(data["Age"], bins=bins, labels=labels, right=False)

data = pd.get_dummies(data, columns=["Age Group"])

data = data.drop("Age", axis=1)

In [8]:
new_data = pd.get_dummies(data, columns=["Nutritional Deficiencies","Medical Conditions","Medications & Treatments","Stress"], prefix=["Deficiency","Condition","Medication","Stress"], drop_first=True)

new_data

Unnamed: 0,Id,Genetics,Hormonal Changes,Poor Hair Care Habits,Environmental Factors,Smoking,Weight Loss,Hair Loss,Age Group_18-25,Age Group_26-35,Age Group_36-45,Age Group_46-50,Deficiency_Iron deficiency,Deficiency_Magnesium deficiency,Deficiency_No Data,Deficiency_Omega-3 fatty acids,Deficiency_Protein deficiency,Deficiency_Selenium deficiency,Deficiency_Vitamin A Deficiency,Deficiency_Vitamin D Deficiency,Deficiency_Vitamin E deficiency,Deficiency_Zinc Deficiency,Condition_Androgenetic Alopecia,Condition_Dermatitis,Condition_Dermatosis,Condition_Eczema,Condition_No Data,Condition_Psoriasis,Condition_Ringworm,Condition_Scalp Infection,Condition_Seborrheic Dermatitis,Condition_Thyroid Problems,Medication_Antibiotics,Medication_Antidepressants,Medication_Antifungal Cream,Medication_Blood Pressure Medication,Medication_Chemotherapy,Medication_Heart Medication,Medication_Immunomodulators,Medication_No Data,Medication_Rogaine,Medication_Steroids,Stress_Low,Stress_Moderate
0,133992,Yes,No,Yes,Yes,No,No,0,True,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,True
1,148393,No,No,Yes,Yes,No,No,0,False,False,True,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False
2,155074,No,No,Yes,Yes,No,Yes,0,False,True,False,False,False,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,True
3,118261,Yes,Yes,Yes,Yes,No,No,0,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,True,False,False,False,False,False,False,False,False,False,False,True
4,111915,No,No,No,Yes,Yes,No,1,False,True,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
994,184367,Yes,No,Yes,Yes,Yes,Yes,1,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,True,False
995,164777,Yes,Yes,No,No,No,Yes,0,False,False,False,True,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False
996,143273,No,Yes,Yes,No,Yes,Yes,1,True,False,False,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True
997,169123,No,Yes,Yes,Yes,Yes,Yes,1,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,True


In [9]:
new_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999 entries, 0 to 998
Data columns (total 44 columns):
 #   Column                                Non-Null Count  Dtype 
---  ------                                --------------  ----- 
 0   Id                                    999 non-null    int64 
 1   Genetics                              999 non-null    object
 2   Hormonal Changes                      999 non-null    object
 3   Poor Hair Care Habits                 999 non-null    object
 4   Environmental Factors                 999 non-null    object
 5   Smoking                               999 non-null    object
 6   Weight Loss                           999 non-null    object
 7   Hair Loss                             999 non-null    int64 
 8   Age Group_18-25                       999 non-null    bool  
 9   Age Group_26-35                       999 non-null    bool  
 10  Age Group_36-45                       999 non-null    bool  
 11  Age Group_46-50                 

In [10]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

X = new_data.drop(columns=["Hair Loss", "Id"])
y = new_data["Hair Loss"]

X = X.apply(label_encoder.fit_transform)

In [11]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999 entries, 0 to 998
Data columns (total 42 columns):
 #   Column                                Non-Null Count  Dtype
---  ------                                --------------  -----
 0   Genetics                              999 non-null    int64
 1   Hormonal Changes                      999 non-null    int64
 2   Poor Hair Care Habits                 999 non-null    int64
 3   Environmental Factors                 999 non-null    int64
 4   Smoking                               999 non-null    int64
 5   Weight Loss                           999 non-null    int64
 6   Age Group_18-25                       999 non-null    int64
 7   Age Group_26-35                       999 non-null    int64
 8   Age Group_36-45                       999 non-null    int64
 9   Age Group_46-50                       999 non-null    int64
 10  Deficiency_Iron deficiency            999 non-null    int64
 11  Deficiency_Magnesium deficiency       999 non

In [12]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Logistic Regression

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

model = LogisticRegression()

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.5
Confusion Matrix:
 [[82 62]
 [88 68]]
Classification Report:
               precision    recall  f1-score   support

           0       0.48      0.57      0.52       144
           1       0.52      0.44      0.48       156

    accuracy                           0.50       300
   macro avg       0.50      0.50      0.50       300
weighted avg       0.50      0.50      0.50       300



Decision Tree


In [14]:
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

def report_model(model):
    model_preds = model.predict(X_test)
    print(classification_report(y_test,model_preds))
    print('\n')
    plt.figure(figsize=(12,8),dpi=150)
    plot_tree(model,filled=True,feature_names=X_test.columns);

In [15]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

dt = DecisionTreeClassifier(random_state=42)

dt.fit(X_train, y_train)

y_pred = dt.predict(X_test)

print(confusion_matrix(y_test,y_pred))

print(classification_report(y_test,y_pred))

# print(y_pred)

acc = accuracy_score(y_test, y_pred)
print("Test set accuracy: {:.2f}".format(acc))

[[82 62]
 [85 71]]
              precision    recall  f1-score   support

           0       0.49      0.57      0.53       144
           1       0.53      0.46      0.49       156

    accuracy                           0.51       300
   macro avg       0.51      0.51      0.51       300
weighted avg       0.51      0.51      0.51       300

Test set accuracy: 0.51


In [16]:
dt.feature_importances_

array([0.03870821, 0.04117292, 0.05382967, 0.06713262, 0.0107299 ,
       0.04393839, 0.0319557 , 0.0127504 , 0.03062418, 0.022191  ,
       0.01934692, 0.03067518, 0.00974881, 0.01889409, 0.00710146,
       0.01810458, 0.01881114, 0.02675874, 0.01831261, 0.01230224,
       0.02007876, 0.0339484 , 0.01371331, 0.01290643, 0.01651433,
       0.03845028, 0.00524869, 0.02346251, 0.02287858, 0.00410444,
       0.02244021, 0.01673115, 0.0287202 , 0.01658999, 0.02602041,
       0.01560015, 0.03260869, 0.        , 0.0174287 , 0.02691227,
       0.04733277, 0.02522097])

In [17]:
# report_model(dt)

Entropy as criterion for Decision Tree

In [18]:
param_grid = {
    'criterion': ['gini', 'entropy', 'log_loss'],   # Measure of quality of split
    'splitter': ['best', 'random'],                 # Split strategy at each node
    'max_depth': [None, 1, 2, 3, 4, 5, 10, 15, 20], # Max depth of the tree
    'min_samples_split': [2, 5, 10, 0.1, 0.2],     # Minimum samples to split a node (can be int or float)
    'min_samples_leaf': [1, 2, 5, 10, 0.1, 0.2],   # Minimum samples at leaf node (int or fraction)
    'min_weight_fraction_leaf': [0.0, 0.1, 0.2],   # Minimum weighted fraction of sum total weights at a leaf node
    'max_features': [None, 'sqrt', 'log2', 0.5, 0.7], # Features to consider when finding best split (can be int, float, or string)             
    'max_leaf_nodes': [None, 5, 10, 20],  # Limit max number of leaf nodes
    'min_impurity_decrease': [0.0, 0.1, 0.2],      # Minimum impurity decrease to split a node
    'class_weight': [None, 'balanced'],            # Weights associated with classes
    'ccp_alpha': [0.0, 0.01, 0.1, 0.2],            # Complexity parameter for minimal cost-complexity pruning
    'monotonic_cst': [None]  # Monotonicity constraint (None, as it requires specific configuration for each feature)
}


In [45]:
from sklearn.model_selection import GridSearchCV
import warnings

grid_search = GridSearchCV(estimator=dt,param_grid=param_grid,cv=5, scoring='accuracy')

warnings.filterwarnings('ignore')

grid_search.fit(X_train,y_train)

Error: Failed to execute this cell, please try again.

In [50]:
grid_search.best_score_

0.5622302158273381

In [52]:
grid_search.best_params_

{'criterion': 'entropy',
 'max_depth': 15,
 'max_features': 0.5,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'splitter': 'best'}

In [48]:
# from sklearn.tree import DecisionTreeClassifier
# from sklearn.metrics import accuracy_score

# dt = DecisionTreeClassifier(random_state=42, criterion='entropy', max_depth = 3)

# dt.fit(X_train, y_train)

# y_pred = dt.predict(X_test)

# print(confusion_matrix(y_test,y_pred))

# print(classification_report(y_test,y_pred))

# # print(y_pred)

# acc = accuracy_score(y_test, y_pred)
# print("Test set accuracy: {:.2f}".format(acc))

# report_model(dt)

In [49]:
# from sklearn.ensemble import RandomForestClassifier
# model = RandomForestClassifier(random_state=42)
# model.fit(X_train, y_train)

# from sklearn.metrics import accuracy_score, classification_report
# y_pred = model.predict(X_test)
# print("Accuracy:", accuracy_score(y_test, y_pred))
# print(classification_report(y_test, y_pred))
