# Rare disease and cancer

Huntington disease is a rare disease.

"Huntington’s disease is a genetic, progressive, neurodegenerative disorder characterized by the gradual development of involuntary muscle movements. [...] Dementia is typically associated with progressive disorientation and confusion, personality disintegration, etc. [...].

Symptoms commonly develop between ages 30 and 50. The disease progresses slowly and a person may live for another 15-20 years after the onset of symptoms.

[...]

Huntington’s disease is inherited as an autosomal dominant trait. Human traits, including the classic genetic diseases, are the product of the interaction of two genes, one received from the father and one from the mother.

In dominant disorders, a single copy of the disease gene (received from either the mother or father) will be expressed “dominating” the other normal gene and resulting in the appearance of the disease. The risk of transmitting the disorder from affected parent to offspring is 50 percent for each pregnancy regardless of the sex of the resulting child.
"
[Source https://rarediseases.org/rare-diseases/huntingtons-disease/]


Source: https://hopes.stanford.edu/population-genetics-and-hd/

| Population | Frequency of HD |
| --- | --- |
| South Africa (blacks)| 0.6| 
| Japan| 1-4| 
| Hong Kong | 3.7|
| Finland | 6.0|
| Europe & countries||
| of European descent | 40-100|
| -Northern Ireland | 64|
| -South Wales | 76.1| 
| -Scotland (Grampian Region) | 99.4| 
| -United States | 100|

(cases per million people)

Several studies demonstrate that the cancer prevalence in HD is much lower than in general poplulation.

Scientits are trying to apply the possible treatment for pain for HD patients. Data from HD patients in the UK was collected to create a model that will determine wheather a specific drug "A" is going to help with pain.

In [200]:
import random
import numpy as np
import pandas as pd
from sklearn.svm import SVC
from itertools import product

In [182]:
#1 is respond well, 0 is not good response, this is our label and what we want to predict.
response = [1]*int(2500/2) + [0]*int(2500/2)

df = pd.DataFrame()

#age
x = np.arange(21,65)
pmf = poisson.pmf(x, 50)
df['age'] = [random.choices(x, weights = pmf, k = 1)[0] for n in range(2499)]

df['gender'] = np.random.randint(0,2,size=2499).tolist() #0 male, 1 female
df['marital status'] = [[1, 0][random.random()>0.8] for n in range(2499)] #0 single, 1 married

#0 British, 1 South Africa, 2 Egypt
#the prevalance is South Africa is very low, and in Egypt is very high but very little research
country = ['British', 'South Africa', 'Egypt']
participants = [0.47, 0.2 , 0.33]
#df['nationality'] = []

races = ['white european', 'asian', 'black', 'mixed', 'other']
prevalence = [[.9, .01, .01, .05, .03], [.8, .01, .01, .15, .03], [.33, .01, .01, .15, .5]]

race = []
nationality = []
for i  in range(len(races)):
    for x in range(len(country)):
        race += [i]*round(prevalence[x][i]*int(participants[x]*2500))
        nationality += [x]*round(prevalence[x][i]*int(participants[x]*2500))
print(len(race), len(nationality))
df['race'] = race[:-1]
df['nationality'] = nationality[:-1]
#comorbidities
df['depression'] = [[1, 0][random.random()>0.7] for n in range(2499)]
df['cancer'] = [[1, 0][random.random()>0.06] for n in range(2499)]
df['recurrent infections'] = [[1, 0][random.random()>0.7] for n in range(2499)]

#Add our subject
#44 years old, female, married, black, depression, cancer, no recurring infections,
# let's assume for now this individual is British
# the response is 0 (no good response to drug)
df.loc[len(df.index)] = [44, 1, 1, 2, 0, 1, 1, 0]

2500 2500


In [183]:
df

Unnamed: 0,age,gender,marital status,race,nationality,depression,cancer,recurrent infections
0,49,0,1,0,0,0,0,1
1,48,1,1,0,0,0,0,0
2,47,0,1,0,0,1,0,1
3,44,1,1,0,0,1,0,0
4,57,0,1,0,0,1,0,0
...,...,...,...,...,...,...,...,...
2495,55,0,1,4,2,1,0,1
2496,44,0,1,4,2,1,0,1
2497,50,1,0,4,2,1,0,0
2498,60,1,1,4,2,1,0,1


In [184]:
#Check if any other patient data coincides with the one of your MP
df[(df.age==44) & (df.gender==1) & (df['marital status']==1) & (df.nationality==0) &
   (df.race==2) & (df.depression==1) & (df['recurrent infections']==0)]

Unnamed: 0,age,gender,marital status,race,nationality,depression,cancer,recurrent infections
2499,44,1,1,2,0,1,1,0


A well known medical insurance wants to make sure of how much they should charge their clients. 

Typically for any insurance to be valid any medical relevant history needs to be disclosed, however you do not need to disclose being carrier of a genetic disease if no symptoms are shown.

In [185]:
prng = np.random.RandomState(12)
svc = SVC(C=1, gamma=3, probability=True, random_state=prng)
svc.fit(df, response)

SVC(C=1, gamma=3, probability=True,
    random_state=RandomState(MT19937) at 0x7F5CF67ED040)

Predict drug response

In [197]:
#DATA in the model
test_example = pd.DataFrame(
    {
        'age': 60,
        'gender': 1,
        'marital status': 1,
        'race': 4,
        'nationality': 2,
        'depression': 1,
        'cancer': 0,
        'recurrent infections':1
    }, index=[1]
)

predictions = svc.predict_proba(test_example)
print(f'non-cancer score = {predictions[0][0]:.2f}')
print(f'cancer score = {predictions[0][1]:.2f}')

non-cancer score = 0.95
cancer score = 0.05


In [205]:
#DATA NOT in the model
test_example = pd.DataFrame(
    {
        'age': 22,
        'gender': 0,
        'marital status': 1,
        'race': 2,
        'nationality': 2,
        'depression': 1,
        'cancer': 0,
        'recurrent infections':1
    }, index=[1]
)

predictions = svc.predict_proba(test_example)
print(f'bad treatment response score = {predictions[0][0]:.2f}')
print(f'good treatment response score = {predictions[0][1]:.2f}')

bad treatment response score = 0.71
good treatment response score = 0.29


so all people in the study will give a confidence score of about around 0.05 and the people not in the study of between around 0.2-0.3

 # Attack

In [218]:
feature_vals = {
        'age': [44],
        'gender': [1,0],
        'marital status': [1,0],
        'race': [2,4],
        'nationality': [0],
        'depression': [1],
        'cancer': [1,0],
        'recurrent infections':[0,1]
}

all_combinations = product(*feature_vals.values())
print(all_combinations)
g = {}
for _, combination in enumerate(all_combinations):
    # Turn this particular combination into a dictionary
    g[_] = {n: v for n, v in zip(feature_vals.keys(), combination)}
attack_inputs = pd.DataFrame(g).T

probs = svc.predict_proba(attack_inputs)

# Add the prob cancer to the dataframe
attack_values = attack_inputs.copy()
attack_values['confidence'] = probs[:, 1]
#sorted_attack_values = attack_values.sort_values(by='confidence', 
#                                                 ascending=False)[['bmi_group', 
#                                                                   'blood_pressure', 'confidence']]

<itertools.product object at 0x7f5cf639e280>


In [219]:
#our fictional patient is at the top
attack_values

Unnamed: 0,age,gender,marital status,race,nationality,depression,cancer,recurrent infections,confidence
0,44,1,1,2,0,1,1,0,0.051478
1,44,1,1,2,0,1,1,1,0.262905
2,44,1,1,2,0,1,0,0,0.262118
3,44,1,1,2,0,1,0,1,0.242035
4,44,1,1,4,0,1,1,0,0.282189
5,44,1,1,4,0,1,1,1,0.283887
6,44,1,1,4,0,1,0,0,0.226822
7,44,1,1,4,0,1,0,1,0.259369
8,44,1,0,2,0,1,1,0,0.265048
9,44,1,0,2,0,1,1,1,0.283151
