## Assignment 1: Make a Cultist Model
1. import the data from the dataset, and store it in a pandas variable ***cult***
2. Divide the data into two datasets, ***cultists*** and ***roles***. Make the X dataset for the stats(*features*) of the cultists, and the Y dataset for the position(*targets*) for the dataset.
3. Refactor the dataset ***cultists***, by removing the irrelevent features from in the dataset, so it only contains charactistic statistics
```python
['stealth', 'Influence', 'Endurance', 'Lore', 'Economic', 'Strength', 'Insanity']
```
4. Refactor the dataset ***roles*** so each value become machine friendly numbers, instead of strings.
5. Use LinearRegression to make a model object.
6. Use train_test_split from model_selection, to make training and testing dataset from ***cultists*** and ***roles***
   1. Try adjusting the test_size variable later, to see what gives the better scores.
7. Feed the model with the training datasets, and score the model on the training dataset (*model.score*)
8. Score the model on the testing datasets, and compare to the testing scores.

In [102]:
import pandas as pd
import numpy as np
import sklearn.linear_model
from sklearn.model_selection import train_test_split

#1.1
data = pd.read_csv('https://raw.githubusercontent.com/Micniks/Python-Week11-Group-3-Assignments/main/cultists.csv')

#1.2, 1.3
xs = data[['Stealth', 'Influence', 'Endurance', 'Lore', 'Economic', 'Strength', 'Insanity']]
ys = data[['Position']]


label_mapping = {
    'Priest':0,
    'Enforcer': 1,
    'Assassin': 2,
    'Recruiter': 3,
    'Accountant': 4,
    'Advisor': 5
}
#1.4
ys['Position'] = ys['Position'].map(label_mapping)

#ys_reshape = np.array(ys['Position']).reshape(-1,1)

#1.5
model = sklearn.linear_model.LinearRegression()
#1.6
x_train, x_test, y_train, y_test = train_test_split(xs, ys, test_size=0.50)
#1.7
model.fit(x_train, y_train)

#1.8
display(model.score(x_train, y_train))
display(model.score(x_test, y_test))

targets = model.predict(x_train)
display(targets)





0.48836495928861123

0.510033833738618

array([[0.62673441],
       [3.30464984],
       [2.92289943],
       ...,
       [2.11680296],
       [0.29082271],
       [3.82843528]])

## Assignment 2: Classification of Cultists
*Since the cultist data doesn't seem to have any linear progression, another model might give better result in sorting members...*

1. Use DecisionTreeClassifier to make a new model object
2. Use train_test_split from model_selection, to make training and testing dataset from ***cultists*** and ***roles***
   1. Try adjusting the test_size variable later, to see what gives the better scores.
3. Feed the model with the training datasets, and score the model on the training dataset (*accuracy _score from sklearn.metrics*)
4. Score the model on the testing datasets, and compare to the testing scores.
5. Compare the scores from using Classification and using Regression models.

*We suspect the score difference is the result of the datasets structure not progressing in a linear fashion*

In [103]:
from sklearn.tree import DecisionTreeClassifier
#2.1
model = DecisionTreeClassifier()
#2.2 
x_train, x_test, y_train, y_test = train_test_split(xs, ys, test_size=0.30)
#2.3
model.fit(x_train, y_train)

#2.4
display(model.score(x_train, y_train))
display(model.score(x_test, y_test))

#2.5
#it is obvious that the dataset is not linear and therefore it is better to use classification



1.0

0.7255

## Assignment 3: Find Clusters of Cultists
1. Use the orignal dataset again, and remove the charactistic statistics used earlier, as well as the name and position for each member.
2. Remove all rows with missing data for any features
3. Change the Living_Area feature into multiple numeric features, using One Hot Encoding
4. Make a model/analyzer from sklearn.cluster, with the appropriate bandwidth for the data.
5. Use the analyzer to process the data, grouping people into clusters.
6. Make a dataset from the cluster-array, showing the avarage statistic values for features in each cluster
7. Add a count feature to each cluster, showing how many members are in each cluster
8. Look at the final data, and answer the following questions:
   1. How many clusters are there
   2. Which cluster has the highest amount of recruits
   3. What seems to be the defining feature(s) for each cluster

In [116]:
from sklearn import preprocessing
from sklearn.cluster import estimate_bandwidth
from sklearn.cluster import MeanShift

#3.1
c_data = data.drop(['Cultist','Stealth', 'Influence', 'Endurance', 'Lore', 'Economic', 'Strength', 'Insanity', 'Position'], axis=1)

#display(c_data)

#3.2
c_data.dropna(inplace=True)

#3.3
label_enc =preprocessing.LabelEncoder()
c_data['Type_of_living_area'] = label_enc.fit_transform(c_data['Type_of_living_area'].astype(str))
#display(c_data)

#3.4
#estimate_bandwidth(c_data)
analyzer = MeanShift(bandwidth=300) 
#3.5
analyzer.fit(c_data)

#3.6
labels = analyzer.labels_
c_data['cluster_group'] = np.nan
data_length=len(c_data)
for i in range(data_length): # loop 714 rows
    c_data.iloc[i,c_data.columns.get_loc('cluster_group')] = labels[i] #set the cluster label on each row


#Grouping passengers by Cluster
c_cluster_data = c_data.groupby(['cluster_group']).mean()
#Count of passengers in each cluster
c_cluster_data['Counts'] = pd.Series(c_data.groupby(['cluster_group']).size())
c_cluster_data = c_cluster_data[c_cluster_data['Counts'] >= 20]
display(c_cluster_data)





Unnamed: 0_level_0,Age,Membership_in_years,Contribution_in_dollars,Members_recruited,Type_of_living_area,Counts
cluster_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0.0,37.927329,7.931056,409.625466,7.703727,0.537267,1610
1.0,43.368941,15.510772,1291.470377,16.771993,0.570916,1114
2.0,46.542805,18.577413,2047.735883,18.26776,0.555556,549
3.0,48.342975,20.81405,2614.900826,18.394628,0.636364,484
4.0,50.532688,23.348668,3194.929782,25.847458,0.607748,413
5.0,49.794872,24.166667,3763.641026,27.557692,0.512821,312
6.0,52.62215,25.807818,4305.517915,28.465798,0.570033,307
7.0,51.856522,27.295652,4835.486957,18.191304,0.526087,230
8.0,53.349673,29.398693,5446.395425,31.323529,0.584967,306
9.0,56.090323,32.341935,6822.909677,35.754839,0.625806,155
