# 23andME SNP logistic regression :
I trained a logistic regression model on data from the [Human Genome Diversity Project](http://www.hagsc.org/hgdp/). The goal for this project was to figure out whether my DNA is a closer to match to northern or southern Han chinese DNA.

Future Improvements: Since I was running the algorithm from my computer without any big data/distributed computing paradigms I had to take a random samples of SNPs from my DNA and the genome database. When I get better with a big data computing paradigm I'd like to rerun the algorithm and see if there are any significant changes

## Prepping and cleaning data:

In [30]:
import pandas as pd
Eric = pd.read_csv("Eric_23andMe.txt", delimiter= "\t", usecols=["rsid", "genotype"]).set_index("rsid").transpose()
sample= pd.read_csv("HGDP_SampleList.txt",header =None, names = ["code"])
key = pd.read_csv('key.csv', header= None, delimiter = " ", usecols=[1,2],names = ["code", "group"])
sample = sample.merge(key, how="left", on="code")
han = ["Han", "Han.NChina"]
hans = sample[sample['group'].apply(lambda x: x in han)]
han_codes = ["SNP"]+list(hans.code.unique())
final = pd.read_table("HGDP_FinalReport_Forward.txt",delimiter = "\t", usecols=han_codes, index_col=0,\
                      dtype="category").transpose()
final = pd.merge(key,final, how="right", right_index=True, left_on="code")
final = final.set_index("code")

## Sampling set:

In [101]:
import numpy as np
SNPs = list(set(final.columns)&set(Eric.columns))
toy_SNPs = ["group"]+list(np.random.choice(SNPs, 1000))
Eric_toy = Eric.loc[:,toy_SNPs]
final.index
final_toy = final.loc[:,toy_SNPs]

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)


## Evaluating Model:

In [102]:
from sklearn.model_selection import train_test_split
encoded_toy = pd.get_dummies(final_toy).drop("group_Han.NChina", axis = 1)
y = encoded_toy['group_Han']
x = encoded_toy.drop('group_Han', axis =1)
#x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0, random_state = 0)

Eric_encoded = pd.get_dummies(Eric_toy.drop('group', axis =1))

for col in set(x.columns)-set(Eric_encoded.columns):
    Eric_encoded[col] = 0
for col in set(Eric_encoded.columns)-set(x.columns):
    x[col] = 0
print(Eric_encoded.shape, x.shape)
alg = linear_model.LogisticRegressionCV()
alg.fit(x_train, y_train)

cv_results = model_selection.cross_validate(alg, x, y, return_train_score=True)
cv_results

(1, 2899) (44, 2899)


{'fit_time': array([ 0.14259291,  0.16774511,  0.1822772 ]),
 'score_time': array([ 0.00047278,  0.00046277,  0.0004859 ]),
 'test_score': array([ 0.75      ,  0.78571429,  0.78571429]),
 'train_score': array([ 0.78571429,  0.76666667,  0.76666667])}

## Running model on my DNA

In [108]:
print(x.shape, Eric_encoded.shape)
alg.fit(x,y)
alg.predict_proba(Eric_encoded)

(44, 2899) (1, 2899)


array([[ 0.22741986,  0.77258014]])

## Conclusion: based on the model that I trained, my DNA is a closer match to southern Han Chinese dataset. This result seems reasonable since 2 of my grandparents are southern chinese, 1 is northern chinese, and 1 is on the border.