# Applying the Data Analysis Method to a Research Problem

# 1. Determine Research Objectives and Assess the Situation  <a class="anchor" id="Businessunderstanding"></a>
The first stage of the process is to understand what you want to accomplish from a research perspective. You may have competing objectives and constraints that must be properly balanced. The goal of this stage of the process is to uncover important factors that could influence the outcome of the project. Neglecting this step can mean that a great deal of effort is put into producing the right answers to the wrong questions.

## 1.1. Big Five Personality Test Question-Trait  Correlation <a class="anchor" id="Title"></a>

## 1.2. Introduction <a class="anchor" id="Introduction"></a> 
- The project is about the 2018 big 5 personality online test dataset.
- The big five personality theory is the most trackable and consistant personality theory used in reasearch. The data set I will be using is one taken from a website called truity.
https://www.truity.com/test/big-five-personality-test

- Unfortunately I can only find out so much from the data I have been given as the test doesn't ask information such as jobs, hobbies, phobias, religion, ethnicity, physique, politics etc. 

- Neither can it find the possible test scores of the people the test takers surround themselves with, which might have shown a correlation of similar or opposite scorers that are friends, family or in a relationship.

- It also is limited to people who care to take the test to begin with, which will likely scew the data in favour of people that are higher in traits of Openess due to a higher interest in more abstract topics. This is also shown in the average Openess score of all users that took the test being above 50% (Will be shown in the describe data section.)

- Even if all of these were accounted for there would be no way to confirm the validity of the information so we are left with largely anonymous data.

- On kaggle there has already been plenty of data analysis performed, such as compiling test results and showing the average scores per country.

- I want to test the hypothesis of if there are more people with above average scores in Conscientiousness (C) and below average scores in Neuroticism (N) and Openess (O), as opposed to the opposite. What this is meant to find out is that a high score in (C) can limit the extent at which high score of (N) and (O) can go.




- Explain how you are going to go about responding to the brief. 
- Include a brief outline of your method of enquiry. 
 

## 1.3.Terminology and Key Words<a class="anchor" id="Terminology"></a>
- The Big Five personality model divides personality characteristics into 5 categories, Openess, Conscientiousness, Extraversion, Agreeableness and Neuroticism, also known as OCEAN. 

- Openess measures creativity, openess to new experience, interest in the abstract and higher IQ scores.

- Conscientiousness measures worth ethic, industriousness, rule following, preperation, and orderliness.

- Extraversion measures an individual's time spent interacting with the external world as opposed to the internal.

- Agreeableness measures cooperation, trust, empathy and kindness.

- Neuroticism measures an individual's tendency to feel and be affected by negative emotion.

- Sources:

https://www.verywellmind.com/the-big-five-personality-dimensions-2795422

https://www.psychologytoday.com/us/basics/big-5-personality-traits

https://psychcentral.com/lib/the-big-five-personality-traits

## 1.2.Background <a class="anchor" id="Background"></a>
- Dataset to find results per country:
https://www.kaggle.com/evgeniidorovskikh/big-5-personality-test-per-country-exploration
- Surveys publications (books, journals and sometimes conference papers) on work that has already been done on the topic of your report.  
- Introduce your review by explaining how you went about finding your materials, and any clear trends in research that have emerged.  
- Group your texts in themes. Write about each theme as a separate section, giving a critical summary of each piece of work, and showing its relevance to your research. 
- Conclude with how the review has informed your research (things you’ll be building on, gaps you’ll be filling etc.). 


 ## 1.3 Research Questions <a class="anchor" id="Research Question"></a>

The primariy things I am trying to answer are:


- The test questions that impact the score of a specific trait but are least likely to be ticked as '5' (most accurate).

- The test questions that impact the score of a specific trait but are least likely to be ticked as '1' (least accurate).


- I want to test the hypothesis of if there are more people with above average scores in Conscientiousness (C) and below average scores in Neuroticism (N) and Openess (O), as opposed to the opposite. What this is meant to find out is that a high score in (C) can limit the extent at which high score of (N) and (O) can go.

- What are the questions that are most frequently skipped.

 ## 1.4 Methodology/Methods <a class="anchor" id="Methodology/Methods"></a>


- State clearly how you carried out your investigation. 
- Explain why you chose this particular method. Is it based on the research in your background section? 
- Include techniques and any equipment you used. for exampl any python libraies or external tools like excel or screen scraping.
- If there were participants in your research, who were they? How many? How were they selected?  
- Write this section concisely but thoroughly.  
- You know what you did, but could a reader follow your description?   

# 2. Stage  Two - Data Understanding <a class="anchor" id="Dataunderstanding"></a>
The second stage of the process requires you to acquire the data listed in the project resources. This initial collection includes data loading, if this is necessary for data understanding. For example, if you use a specific tool for data understanding, it makes perfect sense to load your data into this tool. If you acquire multiple data sources then you need to consider how and when you're going to integrate these.

## 2.1 Initial Data Report <a class="anchor" id="Datareport"></a>
Initial data collection report - 
List the data sources acquired together with their locations, the methods used to acquire them and any problems encountered. Record problems you encountered and any resolutions achieved. This will help both with future replication of this project and with the execution of similar future projects.

In [170]:
# Import Libraries Required
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import seaborn as sns
from google.colab import drive
drive.mount("/content/gdrive")

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [171]:
#Data source: 
#Source Query location: 
path = '/content/gdrive/My Drive/data-final.csv'
# reads the data from the file - denotes as CSV, it has no header, sets column headers
df = pd.read_csv(path, sep='\t+') 

  """


In [172]:
df

Unnamed: 0,EXT1,EXT2,EXT3,EXT4,EXT5,EXT6,EXT7,EXT8,EXT9,EXT10,EST1,EST2,EST3,EST4,EST5,EST6,EST7,EST8,EST9,EST10,AGR1,AGR2,AGR3,AGR4,AGR5,AGR6,AGR7,AGR8,AGR9,AGR10,CSN1,CSN2,CSN3,CSN4,CSN5,CSN6,CSN7,CSN8,CSN9,CSN10,...,AGR1_E,AGR2_E,AGR3_E,AGR4_E,AGR5_E,AGR6_E,AGR7_E,AGR8_E,AGR9_E,AGR10_E,CSN1_E,CSN2_E,CSN3_E,CSN4_E,CSN5_E,CSN6_E,CSN7_E,CSN8_E,CSN9_E,CSN10_E,OPN1_E,OPN2_E,OPN3_E,OPN4_E,OPN5_E,OPN6_E,OPN7_E,OPN8_E,OPN9_E,OPN10_E,dateload,screenw,screenh,introelapse,testelapse,endelapse,IPC,country,lat_appx_lots_of_err,long_appx_lots_of_err
0,4.0,1.0,5.0,2.0,5.0,1.0,5.0,2.0,4.0,1.0,1.0,4.0,4.0,2.0,2.0,2.0,2.0,2.0,3.0,2.0,2.0,5.0,2.0,4.0,2.0,3.0,2.0,4.0,3.0,4.0,3.0,4.0,3.0,2.0,2.0,4.0,4.0,2.0,4.0,4.0,...,4750.0,5475.0,11641.0,3115.0,3207.0,3260.0,10235.0,5897.0,1758.0,3081.0,6602.0,5457.0,1569.0,2129.0,3762.0,4420.0,9382.0,5286.0,4983.0,6339.0,3146.0,4067.0,2959.0,3411.0,2170.0,4920.0,4436.0,3116.0,2992.0,4354.0,2016-03-03 02:01:01,768.0,1024.0,9.0,234.0,6,1,GB,51.5448,0.1991
1,3.0,5.0,3.0,4.0,3.0,3.0,2.0,5.0,1.0,5.0,2.0,3.0,4.0,1.0,3.0,1.0,2.0,1.0,3.0,1.0,1.0,4.0,1.0,5.0,1.0,5.0,3.0,4.0,5.0,3.0,3.0,2.0,5.0,3.0,3.0,1.0,3.0,3.0,5.0,3.0,...,2158.0,2090.0,2143.0,2807.0,3422.0,5324.0,4494.0,3627.0,1850.0,1747.0,5163.0,5240.0,7208.0,2783.0,4103.0,3431.0,3347.0,2399.0,3360.0,5595.0,2624.0,4985.0,1684.0,3026.0,4742.0,3336.0,2718.0,3374.0,3096.0,3019.0,2016-03-03 02:01:20,1360.0,768.0,12.0,179.0,11,1,MY,3.1698,101.706
2,2.0,3.0,4.0,4.0,3.0,2.0,1.0,3.0,2.0,5.0,4.0,4.0,4.0,2.0,2.0,2.0,2.0,2.0,1.0,3.0,1.0,4.0,1.0,4.0,2.0,4.0,1.0,4.0,4.0,3.0,4.0,2.0,2.0,2.0,3.0,3.0,4.0,2.0,4.0,2.0,...,1089.0,2203.0,3386.0,1464.0,2562.0,1493.0,3067.0,13719.0,3892.0,4100.0,4286.0,4775.0,2713.0,2813.0,4237.0,6308.0,2690.0,1516.0,2379.0,2983.0,1930.0,1470.0,1644.0,1683.0,2229.0,8114.0,2043.0,6295.0,1585.0,2529.0,2016-03-03 02:01:56,1366.0,768.0,3.0,186.0,7,1,GB,54.9119,-1.3833
3,2.0,2.0,2.0,3.0,4.0,2.0,2.0,4.0,1.0,4.0,3.0,3.0,3.0,2.0,3.0,2.0,2.0,2.0,4.0,3.0,2.0,4.0,3.0,4.0,2.0,4.0,2.0,4.0,3.0,4.0,2.0,4.0,4.0,4.0,1.0,2.0,2.0,3.0,1.0,4.0,...,6062.0,11952.0,1040.0,2264.0,3664.0,3049.0,4912.0,7545.0,4632.0,6896.0,2824.0,520.0,2368.0,3225.0,2848.0,6264.0,3760.0,10472.0,3192.0,7704.0,3456.0,6665.0,1977.0,3728.0,4128.0,3776.0,2984.0,4192.0,3480.0,3257.0,2016-03-03 02:02:02,1920.0,1200.0,186.0,219.0,7,1,GB,51.75,-1.25
4,3.0,3.0,3.0,3.0,5.0,3.0,3.0,5.0,3.0,4.0,1.0,5.0,5.0,3.0,1.0,1.0,1.0,1.0,3.0,2.0,1.0,5.0,1.0,5.0,1.0,3.0,1.0,5.0,5.0,3.0,5.0,1.0,5.0,1.0,3.0,1.0,5.0,1.0,5.0,5.0,...,6771.0,2819.0,3682.0,2511.0,16204.0,1736.0,28983.0,1612.0,2437.0,4532.0,3843.0,7019.0,3102.0,3153.0,2869.0,6550.0,1811.0,3682.0,21500.0,20587.0,8458.0,3510.0,17042.0,7029.0,2327.0,5835.0,6846.0,5320.0,11401.0,8642.0,2016-03-03 02:02:57,1366.0,768.0,8.0,315.0,17,2,KE,1.0,38.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1015336,4.0,2.0,4.0,3.0,4.0,3.0,3.0,3.0,3.0,3.0,4.0,3.0,3.0,3.0,4.0,3.0,4.0,3.0,3.0,3.0,5.0,4.0,2.0,5.0,2.0,4.0,2.0,4.0,4.0,4.0,2.0,3.0,3.0,3.0,4.0,4.0,4.0,2.0,3.0,3.0,...,1655.0,1937.0,1233.0,3151.0,2576.0,1888.0,2815.0,2964.0,2665.0,2888.0,3008.0,2367.0,2504.0,2544.0,2144.0,4784.0,3529.0,5072.0,2016.0,3353.0,2649.0,3544.0,7577.0,3096.0,1896.0,3912.0,2744.0,2025.0,1873.0,1232.0,2018-11-08 12:04:58,1920.0,1080.0,3.0,160.0,10,2,US,39.9883,-75.2208
1015337,4.0,3.0,4.0,3.0,3.0,3.0,4.0,4.0,3.0,3.0,4.0,3.0,5.0,1.0,5.0,5.0,4.0,4.0,4.0,5.0,2.0,4.0,1.0,4.0,3.0,5.0,3.0,3.0,5.0,3.0,3.0,2.0,3.0,4.0,3.0,3.0,2.0,3.0,2.0,3.0,...,2422.0,1448.0,3216.0,6160.0,2208.0,1513.0,2785.0,3833.0,3280.0,1184.0,2096.0,1880.0,3209.0,1744.0,4392.0,1943.0,2263.0,1559.0,1304.0,2176.0,2560.0,6632.0,2312.0,2376.0,2969.0,2271.0,4064.0,1144.0,2936.0,1615.0,2018-11-08 12:07:18,1920.0,1080.0,3.0,122.0,7,1,US,38.0,-97.0
1015338,4.0,2.0,4.0,3.0,5.0,1.0,4.0,2.0,4.0,4.0,3.0,2.0,4.0,3.0,2.0,2.0,4.0,2.0,4.0,1.0,3.0,5.0,5.0,3.0,2.0,3.0,2.0,4.0,3.0,5.0,4.0,5.0,3.0,5.0,1.0,5.0,1.0,4.0,1.0,4.0,...,2487.0,1863.0,1745.0,4040.0,4068.0,1480.0,4550.0,3000.0,5873.0,2088.0,5286.0,2232.0,2942.0,2296.0,1841.0,2303.0,1791.0,2744.0,1196.0,4719.0,2121.0,2807.0,1711.0,2335.0,1609.0,3007.0,2727.0,2648.0,2646.0,1287.0,2018-11-08 12:07:49,1920.0,1080.0,2.0,135.0,12,6,US,36.1473,-86.777
1015339,2.0,4.0,3.0,4.0,2.0,2.0,1.0,4.0,2.0,4.0,4.0,3.0,4.0,2.0,4.0,4.0,2.0,2.0,4.0,4.0,2.0,3.0,2.0,4.0,3.0,4.0,2.0,4.0,4.0,3.0,4.0,2.0,4.0,2.0,2.0,2.0,4.0,2.0,4.0,4.0,...,2982.0,5584.0,2567.0,2168.0,6320.0,3055.0,2580.0,2816.0,2544.0,3744.0,5168.0,3903.0,37726.0,2735.0,1367.0,5056.0,3216.0,3320.0,2263.0,1415.0,5024.0,4664.0,4792.0,6471.0,1873.0,3136.0,3129.0,2799.0,7184.0,2526.0,2018-11-08 12:08:34,1920.0,1080.0,6.0,212.0,8,1,US,34.1067,-117.8067


## 2.2 Describe Data <a class="anchor" id="Describedata"></a>
Data description report - Describe the data that has been acquired including its format, its quantity (for example, the number of records and fields in each table), the identities of the fields and any other surface features which have been discovered. Evaluate whether the data acquired satisfies your requirements.

In [None]:
df.columns

Index(['EXT1', 'EXT2', 'EXT3', 'EXT4', 'EXT5', 'EXT6', 'EXT7', 'EXT8', 'EXT9',
       'EXT10',
       ...
       'dateload', 'screenw', 'screenh', 'introelapse', 'testelapse',
       'endelapse', 'IPC', 'country', 'lat_appx_lots_of_err',
       'long_appx_lots_of_err'],
      dtype='object', length=110)

In [None]:
df.shape

(1015341, 110)

In [None]:
df.dtypes

EXT1                     float64
EXT2                     float64
EXT3                     float64
EXT4                     float64
EXT5                     float64
                          ...   
endelapse                  int64
IPC                        int64
country                   object
lat_appx_lots_of_err      object
long_appx_lots_of_err     object
Length: 110, dtype: object

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1015341 entries, 0 to 1015340
Columns: 110 entries, EXT1 to long_appx_lots_of_err
dtypes: float64(104), int64(2), object(4)
memory usage: 852.1+ MB


## 2.3 Verify Data Quality <a class="anchor" id="Verifydataquality"></a>

Examine the quality of the data, addressing questions such as:

- There are questions in which neutral is selected (3) which do not move the trait's score in any direction. This is discouraged in the test as it is assumed this is a question the individual doesn't understand or simply cannot relate.

- Is the data complete (does it cover all the cases required)?
- Is it correct, or does it contain errors and, if there are errors, how common are they?
- Are there missing values in the data? If so, how are they represented, where do they occur, and how common are they?

### 2.3.1. Missing Data <a class="anchor" id="MissingData"></a>
In addition to incorrect datatypes, another common problem when dealing with real-world data is missing values. These can arise for many reasons and have to be either filled in or removed before we train a machine learning model. First, let’s get a sense of how many missing values are in each column 

While we always want to be careful about removing information, if a column has a high percentage of missing values, then it probably will not be useful to our model. The threshold for removing columns should depend on the problem

- Here is where I will find the top three question most frequently answered as '3'.

In [173]:
df.isnull().sum()

EXT1                     1783
EXT2                     1783
EXT3                     1783
EXT4                     1783
EXT5                     1783
                         ... 
endelapse                   0
IPC                         0
country                    77
lat_appx_lots_of_err        0
long_appx_lots_of_err       0
Length: 110, dtype: int64

In [174]:
def missing_values_table(df):
        mis_val = df.isnull().sum()
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        return mis_val_table_ren_columns

In [175]:
missing_values_table(df)

Your selected dataframe has 110 columns.
There are 105 columns that have missing values.


Unnamed: 0,Missing Values,% of Total Values
introelapse,2066,0.2
screenh,2066,0.2
screenw,2066,0.2
EXT1,1783,0.2
EST6_E,1783,0.2
...,...,...
CSN3,1783,0.2
CSN2,1783,0.2
CSN1,1783,0.2
AGR10,1783,0.2


In [176]:
# Get the columns with > 50% missing
missing_df = missing_values_table(df);
missing_columns = list(missing_df[missing_df['% of Total Values'] > 50].index)
print('We will remove %d columns.' % len(missing_columns))

Your selected dataframe has 110 columns.
There are 105 columns that have missing values.
We will remove 0 columns.


In [177]:
# Drop the columns
df = df.drop(list(missing_columns))

In [178]:
aaa = df[df.columns[0:50]]
aaa[aaa==3].dropna().size
# The amount of neutral statements selected by everyone that took the test in the first 50 questions is 34150

34150

### 2.3.2. Outliers <a class="anchor" id="Outliers"></a>
At this point, we may also want to remove outliers. These can be due to typos in data entry, mistakes in units, or they could be legitimate but extreme values. For this project, we will remove anomalies based on the definition of extreme outliers:

https://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm

- Below the first quartile − 3 ∗ interquartile range
- Above the third quartile + 3 ∗ interquartile range

## 2.4 Initial Data Exploration  <a class="anchor" id="Exploredata"></a>
During this stage you'll address data mining questions using querying, data visualization and reporting techniques. These ***may*** include:

- **Distribution** of key attributes (for example, the target attribute of a prediction task)
- **Relationships** between pairs or small numbers of attributes
- Results of **simple aggregations**
- **Properties** of significant sub-populations
- **Simple** statistical analyses

These analyses may directly address your data mining goals. They may also contribute to or refine the data description and quality reports, and feed into the transformation and other data preparation steps needed for further analysis. 

- **Data exploration report** - Describe results of your data exploration, including first findings or initial hypothesis and their impact on the remainder of the project. If appropriate you could include graphs and plots here to indicate data characteristics that suggest further examination of interesting data subsets.


### 2.4.1 Distributions  <a class="anchor" id="Distributions"></a>

In [None]:
def count_values_table(df):
        count_val = df.value_counts()
        count_val_percent = 100 * df.value_counts() / len(df)
        count_val_table = pd.concat([count_val, count_val_percent.round(1)], axis=1)
        count_val_table_ren_columns = count_val_table.rename(
        columns = {0 : 'Count Values', 1 : '% of Total Values'})
        return count_val_table_ren_columns

In [None]:
# Histogram
def hist_chart(df, col):
        plt.style.use('fivethirtyeight')
        plt.hist(df[col].dropna(), edgecolor = 'k');
        plt.xlabel(col); plt.ylabel('Number of Entries'); 
        plt.title('Distribution of '+col);

In [None]:
col = 'account_risk_band'
# Histogram & Results
hist_chart(df, col)
count_values_table(df.account_risk_band)

NameError: ignored

### 2.4.2 Correlations  <a class="anchor" id="Correlations"></a>
Can we derive any correlation from this data-set. Pairplot chart gives us correlations, distributions and regression path
Correlogram are awesome for exploratory analysis. It allows to quickly observe the relationship between every variable of your matrix. 
It is easy to do it with seaborn: just call the pairplot function

Pairplot Documentation cab be found here: https://seaborn.pydata.org/generated/seaborn.pairplot.html

In [None]:
#Seaborn allows to make a correlogram or correlation matrix really easily. 
#sns.pairplot(df.dropna().drop(['x'], axis=1), hue='y', kind ='reg')

#plt.show()


In [None]:
#df_agg = df.drop(['x'], axis=1).groupby(['y']).sum()
df_agg = df.groupby(['y']).sum()

## 2.5 Data Quality Report <a class="anchor" id="Dataqualityreport"></a>
List the results of the data quality verification. If quality problems exist, suggest possible solutions. Solutions to data quality problems generally depend heavily on both data and business knowledge.

# 3. Stage Three - Data Preperation <a class="anchor" id="Datapreperation"></a>
This is the stage of the project where you decide on the data that you're going to use for analysis. The criteria you might use to make this decision include the relevance of the data to your data mining goals, the quality of the data, and also technical constraints such as limits on data volume or data types. Note that data selection covers selection of attributes (columns) as well as selection of records (rows) in a table.

## 3.1 Select Your Data <a class="anchor" id="Selectyourdata"></a>
This is the stage of the project where you decide on the data that you're going to use for analysis. The criteria you might use to make this decision include the relevance of the data to your data mining goals, the quality of the data, and also technical constraints such as limits on data volume or data types. Note that data selection covers selection of attributes (columns) as well as selection of records (rows) in a table.

Rationale for inclusion/exclusion - List the data to be included/excluded and the reasons for these decisions.

In [None]:
X_train_regr = df.drop(['date_maint', 'account_open_date'], axis = 1)
X_train = df.drop(['target', 'date_maint', 'account_open_date'], axis = 1)
X_test = test.drop(['date_maint', 'account_open_date'], axis = 1)

## 3.2 Clean The Data <a class="anchor" id="Cleansethedata"></a>
This task involves raise the data quality to the level required by the analysis techniques that you've selected. This may involve selecting clean subsets of the data, the insertion of suitable defaults, or more ambitious techniques such as the estimation of missing data by modelling.

### 3.2.1 Label Encoding <a class="anchor" id="labelEncoding"></a>
Label Encoding to turn Categorical values to Integers

An approach to encoding categorical values is to use a technique called label encoding. Label encoding is simply converting each value in a column to a number. For example, the body_style column contains 5 different values. We could choose to encode it like this:

convertible -> 0
hardtop -> 1
hatchback -> 2
sedan -> 3
wagon -> 4
http://pbpython.com/categorical-encoding.html

In [None]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
for col in CAT_COLS:
        encoder = LabelEncoder()
        X_train[col] = encoder.fit_transform(X_train[col].astype(str))
        X_test[col] = encoder.transform(X_test[col].astype(str))

In [None]:
df["column"] = df["column"].astype('category')
df.dtypes

In [None]:
df["column"] = df["column"].cat.codes
df.head()

### 3.2.2 Drop Unnecessary Columns <a class="anchor" id="DropCols"></a>
Sometimes we may not need certain columns. We can drop to keep only relevent data

In [None]:
del_col_list = ['col1', 'col2']

df = df.drop(del_col_list, axis=1)
df.head()

### 3.2.3 Altering Data Types <a class="anchor" id="AlteringDatatypes"></a>
Sometimes we may need to alter data types. Including to/from object datatypes

In [None]:
#df['date'] = pd.to_datetime(df['date'])

### 3.2.4 Dealing With Zeros <a class="anchor" id="DealingZeros"></a>
Replacing all the zeros from cols. **Note** You may not want to do this - add / remove as required

In [None]:
#cols = ['col1', 'col2']
#df[cols] = df[cols].replace(0, np.nan)

In [None]:
# dropping all the rows with na in the columns mentioned above in the list.

# df.dropna(subset=cols, inplace=True)


### 3.2.5 Dealing With Duplicates <a class="anchor" id="DealingDuplicates"></a>
Remove duplicate rows. **Note** You may not want to do this - add / remove as required

In [None]:
#df = df.drop_duplicates(keep='first')

## 3.3 Construct Required Data   <a class="anchor" id="Constructrequireddata"></a>
This task includes constructive data preparation operations such as the production of derived attributes or entire new records, or transformed values for existing attributes.

**Derived attributes** - These are new attributes that are constructed from one or more existing attributes in the same record, for example you might use the variables of length and width to calculate a new variable of area.

**Generated records** - Here you describe the creation of any completely new records. For example you might need to create records for customers who made no purchase during the past year. There was no reason to have such records in the raw data, but for modelling purposes it might make sense to explicitly represent the fact that particular customers made zero purchases.

- Here is where I will find the questions that are least likely to be answered as '5' and questions that are least likely to be answered as '1' per each trait.


## 3.4 Integrate Data  <a class="anchor" id="Integratedata"></a>
These are methods whereby information is combined from multiple databases, tables or records to create new records or values.

**Merged data** - Merging tables refers to joining together two or more tables that have different information about the same objects. For example a retail chain might have one table with information about each store’s general characteristics (e.g., floor space, type of mall), another table with summarised sales data (e.g., profit, percent change in sales from previous year), and another with information about the demographics of the surrounding area. Each of these tables contains one record for each store. These tables can be merged together into a new table with one record for each store, combining fields from the source tables.

**Aggregations** - Aggregations refers to operations in which new values are computed by summarising information from multiple records and/or tables. For example, converting a table of customer purchases where there is one record for each purchase into a new table where there is one record for each customer, with fields such as number of purchases, average purchase amount, percent of orders charged to credit card, percent of items under promotion etc.


## 3.5 Primary Data Set  <a class="anchor" id="Primary Data Set"></a>
Construct Our Primary Data Set, this is the pre-processed data set that will be used for the data modeling experiments.

# 4. Modelling <a class="anchor" id="Modelling"></a>
As the first step in modelling, you'll select the actual modelling technique that you'll be using. Although you may have already selected a tool during the business understanding phase, at this stage you'll be selecting the specific modelling technique e.g. Association Rules with Apriori,  decision-tree building with C5.0, Clustering with K-Meand or neural network generation with back propagation. If multiple techniques are applied, perform this task separately for each technique.



## 4.1. Modelling technique <a class="anchor" id="ModellingTechnique"></a>


In [None]:
highest_ext = 0
lowest_ext = 10000000
test = 0
highest_overall = 0
lowest_overall = 10000000

for x in range(1,11):
  print('EXT' + str(x), 'has:', int(df[df['EXT' + str(x)]==5.0].loc[:,'EXT' + str(x)].value_counts()), '\n')
  test = df[df['EXT' + str(x)]==5.0].loc[:,'EXT' + str(x)].value_counts()

  if float(test) > float(highest_ext):
    highest_ext = test
  
  if float(test) < float(lowest_ext):
    lowest_ext = test

if float(highest_ext) > float(highest_overall):
  highest_overall = highest_ext

if float(lowest_ext) < float(lowest_overall):
  lowest_overall = lowest_ext


print('\nThe most frequently selected 5.0 question in EXT is:',highest_ext)
print('\nThe most frequently selected 5.0 question overall is:',highest_overall)

print('\nThe least frequently selected 5.0 question in EXT is:',lowest_ext)
print('\nThe least frequently selected 5.0 question overall is:',lowest_overall)
# This loop goes through the 10 questions that measure extraversion and counts the amount people answered 5.0 to.

# From the ten questions that measure extraversion, EXT6 was the one least likely (at 71662 people selected) to be answered as '5' 
# EXT10 was the one most frequent to be selected as 5.0 (at 312925).

EXT1 has: 80702 

EXT2 has: 126928 

EXT3 has: 185771 

EXT4 has: 159140 

EXT5 has: 193134 

EXT6 has: 71662 

EXT7 has: 148595 

EXT8 has: 248347 

EXT9 has: 154759 

EXT10 has: 312925 


The most frequently selected 5.0 question in EXT is: 5.0    312925
Name: EXT10, dtype: int64

The most frequently selected 5.0 question overall is: 5.0    312925
Name: EXT10, dtype: int64

The least frequently selected 5.0 question in EXT is: 5.0    71662
Name: EXT6, dtype: int64

The least frequently selected 5.0 question overall is: 5.0    71662
Name: EXT6, dtype: int64


In [None]:
lowest_est = 10000000
highest_est = 0
test = 0

for x in range(1,11):
  print('EST' + str(x), 'has:', int(df[df['EST' + str(x)]==5.0].loc[:,'EST' + str(x)].value_counts()), '\n')
  test = df[df['EST' + str(x)]==5.0].loc[:,'EST' + str(x)].value_counts()

  if float(test) > float(highest_est):
    highest_est = test

  if float(test) < float(lowest_est):
    lowest_est = test

if float(highest_est) > float(highest_overall):
  highest_overall = highest_est

if float(lowest_est) < float(lowest_overall):
  lowest_overall = lowest_est

print('\nThe most frequently selected question in EST is:',highest_est)
print('\nThe most frequently selected question overall is:',highest_overall)

print('\nThe least frequently selected 5.0 question in EST is:',lowest_est)
print('\nThe least frequently selected 5.0 question overall is:',lowest_overall)

# This loop goes through the 10 questions that measure neuroticism and counts the amount people answered 5.0 to.

# From the ten questions that measure neuroticism, EST4 was the one least likely (at 95145 people selected) to 
# be answered as 5.0. EST3 was the one most frequent to be selected as 5.0 (at 352309 people selected).

EST1 has: 230791 

EST2 has: 155249 

EST3 has: 352309 

EST4 has: 95145 

EST5 has: 111384 

EST6 has: 128476 

EST7 has: 160678 

EST8 has: 118042 

EST9 has: 156380 

EST10 has: 122124 


The most frequently selected question in EST is: 5.0    352309
Name: EST3, dtype: int64

The most frequently selected question overall is: 5.0    352309
Name: EST3, dtype: int64

The least frequently selected 5.0 question in EST is: 5.0    95145
Name: EST4, dtype: int64

The least frequently selected 5.0 question overall is: 5.0    71662
Name: EXT6, dtype: int64


In [None]:
lowest_agr = 10000000
highest_agr = 0
test = 0

for x in range(1,11):
  print('AGR' + str(x), 'has:', int(df[df['AGR' + str(x)]==5.0].loc[:,'AGR' + str(x)].value_counts()), '\n')
  test = df[df['AGR' + str(x)]==5.0].loc[:,'AGR' + str(x)].value_counts()

  if float(test) > float(highest_agr):
    highest_agr = test

  if float(test) < float(lowest_agr):
    lowest_agr = test

if float(highest_agr) > float(highest_overall):
  highest_overall = highest_agr

if float(lowest_agr) < float(lowest_overall):
  lowest_overall = lowest_agr

 
print('\nThe most frequently selected question in AGR is:',highest_agr)
print('\nThe most frequently selected question overall is:',highest_overall)

print('\nThe least frequently selected 5.0 question in AGR is:',lowest_agr)
print('\nThe least frequently selected 5.0 question overall is:',lowest_overall)

# This loop goes through the 10 questions that measure agreeableness and counts the amount people answered 5.0 to.

# From the ten questions that measure agreeableness, AGR7 was the one least likely (at 40458 people selected) to 
# be answered as 5.0. AGR4 was the one most frequent to be selected as 5.0 (at 374645 people selected).

AGR1 has: 89210 

AGR2 has: 337779 

AGR3 has: 60090 

AGR4 has: 374645 

AGR5 has: 55482 

AGR6 has: 333065 

AGR7 has: 40458 

AGR8 has: 244392 

AGR9 has: 320483 

AGR10 has: 220879 


The most frequently selected question in AGR is: 5.0    374645
Name: AGR4, dtype: int64

The most frequently selected question overall is: 5.0    374645
Name: AGR4, dtype: int64

The least frequently selected 5.0 question in AGR is: 5.0    40458
Name: AGR7, dtype: int64

The least frequently selected 5.0 question overall is: 5.0    40458
Name: AGR7, dtype: int64


In [None]:
lowest_csn = 1000000
highest_csn = 0
test = 0

for x in range(1,11):
  print('CSN' + str(x), 'has:', df[df['CSN' + str(x)]==5.0].loc[:,'CSN' + str(x)].value_counts(), '\n')
  test = df[df['CSN' + str(x)]==5.0].loc[:,'CSN' + str(x)].value_counts()

  if float(test) > float(highest_csn):
    highest_csn = test

  if float(test) < float(lowest_csn):
    lowest_csn = test

if float(highest_csn) > float(highest_overall):
  highest_overall = highest_csn

if float(lowest_csn) < float(lowest_overall):
  lowest_overall = lowest_csn

 
print('\nThe most frequently selected question in CSN is:',highest_csn)
print('\nThe most frequently selected question overall is:',highest_overall)

print('\nThe least frequently selected 5.0 question in CSN is:',lowest_csn)
print('\nThe least frequently selected 5.0 question overall is:',lowest_overall)

# This loop goes through the 10 questions that measure consciousness and counts the amount people answered 5.0 to.

# From the ten questions that measure consciousness, CSN8 was the one least likely (at 47113 people selected) to 
# be answered as 5.0. CSN3 was the one most frequent to be selected as 5.0 (at 370372 people selected).

CSN1 has: 5.0    149803
Name: CSN1, dtype: int64 

CSN2 has: 5.0    159719
Name: CSN2, dtype: int64 

CSN3 has: 5.0    370372
Name: CSN3, dtype: int64 

CSN4 has: 5.0    85129
Name: CSN4, dtype: int64 

CSN5 has: 5.0    95012
Name: CSN5, dtype: int64 

CSN6 has: 5.0    162753
Name: CSN6, dtype: int64 

CSN7 has: 5.0    270068
Name: CSN7, dtype: int64 

CSN8 has: 5.0    47113
Name: CSN8, dtype: int64 

CSN9 has: 5.0    171354
Name: CSN9, dtype: int64 

CSN10 has: 5.0    210359
Name: CSN10, dtype: int64 


The most frequently selected question in CSN is: 5.0    370372
Name: CSN3, dtype: int64

The most frequently selected question overall is: 5.0    374645
Name: AGR4, dtype: int64

The least frequently selected 5.0 question in CSN is: 5.0    47113
Name: CSN8, dtype: int64

The least frequently selected 5.0 question overall is: 5.0    40458
Name: AGR7, dtype: int64


In [None]:
lowest_opn = 1000000
highest_opn = 0
test = 0

for x in range(1,11):
  print('OPN' + str(x), 'has:', df[df['OPN' + str(x)]==5.0].loc[:,'OPN' + str(x)].value_counts(), '\n')
  test = df[df['OPN' + str(x)]==5.0].loc[:,'OPN' + str(x)].value_counts()

  if float(test) > float(highest_opn):
    highest_opn = test

  if float(test) < float(lowest_opn):
    lowest_opn = test

if float(highest_opn) > float(highest_overall):
  highest_overall = highest_opn

if float(lowest_opn) < float(lowest_overall):
  lowest_overall = lowest_opn

 
print('\nThe most frequently selected question in OPN is:',highest_opn)
print('\nThe most frequently selected question overall is:',highest_overall)

print('\nThe least frequently selected 5.0 question in OPN is:',lowest_opn)
print('\nThe least frequently selected 5.0 question overall is:',lowest_overall)
  
# This loop goes through the 10 questions that measure openness and counts the amount people answered 5.0 to.

# From the ten questions that measure openness, OPN2 was the one least likely (at 35057 people selected) to 
# be answered as 5.0. OPN9 was the one most frequent to be selected as 5.0 (at 456735 people selected).

OPN1 has: 5.0    274687
Name: OPN1, dtype: int64 

OPN2 has: 5.0    35057
Name: OPN2, dtype: int64 

OPN3 has: 5.0    421047
Name: OPN3, dtype: int64 

OPN4 has: 5.0    33235
Name: OPN4, dtype: int64 

OPN5 has: 5.0    261517
Name: OPN5, dtype: int64 

OPN6 has: 5.0    37078
Name: OPN6, dtype: int64 

OPN7 has: 5.0    350180
Name: OPN7, dtype: int64 

OPN8 has: 5.0    164467
Name: OPN8, dtype: int64 

OPN9 has: 5.0    456735
Name: OPN9, dtype: int64 

OPN10 has: 5.0    371706
Name: OPN10, dtype: int64 


The most frequently selected question in OPN is: 5.0    456735
Name: OPN9, dtype: int64

The most frequently selected question overall is: 5.0    456735
Name: OPN9, dtype: int64

The least frequently selected 5.0 question in OPN is: 5.0    33235
Name: OPN4, dtype: int64

The least frequently selected 5.0 question overall is: 5.0    33235
Name: OPN4, dtype: int64


In [179]:
fraction_df = len(df) // 200
df_fraction = df.iloc[:fraction_df,]
print(df_fraction)

      EXT1  EXT2  EXT3  ...  country  lat_appx_lots_of_err  long_appx_lots_of_err
0      4.0   1.0   5.0  ...       GB               51.5448                 0.1991
1      3.0   5.0   3.0  ...       MY                3.1698                101.706
2      2.0   3.0   4.0  ...       GB               54.9119                -1.3833
3      2.0   2.0   2.0  ...       GB                 51.75                  -1.25
4      3.0   3.0   3.0  ...       KE                   1.0                   38.0
...    ...   ...   ...  ...      ...                   ...                    ...
5071   1.0   5.0   5.0  ...       US               38.2552               -85.5459
5072   1.0   4.0   2.0  ...       US               42.9773               -87.8941
5073   4.0   1.0   4.0  ...       US               27.0775               -80.2587
5074   3.0   1.0   5.0  ...       MY                3.1698               101.7026
5075   1.0   4.0   3.0  ...       CA               44.0414               -79.4534

[5076 rows x 11

In [285]:
total_extraversion = []
total_neuroticism = []
total_agreeableness = []
total_conscientiousness = []
total_oppeness = []
extraversion_score = 0
neuroticism_score = 0
agreeableness_score = 0
conscientiousness_score = 0
openess_score = 0

for x in range(0,5076):
  for y in range(1,11):
    if (y % 2) != 0:
      extraversion_score += df_fraction['EXT' + str(y)].iloc[x]
    else:
      if df_fraction['EXT' + str(y)].iloc[x] == 1:
        extraversion_score += 5

      elif df_fraction['EXT' + str(y)].iloc[x] == 2:
        extraversion_score += 4

      elif df_fraction['EXT' + str(y)].iloc[x] == 4:
        extraversion_score += 2

      elif df_fraction['EXT' + str(y)].iloc[x] == 5:
        extraversion_score += 1

      else:
        extraversion_score += 3

  total_extraversion.append(int(extraversion_score))
  extraversion_score = 0

for x in range(0,5076):
  for y in range(1,11):
    if (y % 2 ) != 0 or (('EST' + str(y)) == 'EST6' or ('EST' + str(y)) == 'EST8' or ('EST' + str(y)) == 'EST10'):
      neuroticism_score += df_fraction['EST' + str(y)].iloc[x]
    else:
      if df_fraction['EST' + str(y)].iloc[x] == 1:
        neuroticism_score += 5

      elif df_fraction['EST' + str(y)].iloc[x] == 2:
        neuroticism_score += 4

      elif df_fraction['EST' + str(y)].iloc[x] == 4:
        neuroticism_score += 2

      elif df_fraction['EST' + str(y)].iloc[x] == 5:
        neuroticism_score += 1

      else:
        neuroticism_score += 3

  total_neuroticism.append(int(neuroticism_score))
  neuroticism_score = 0

for x in range(0,5076):
  for y in range(1,11):
    if (y % 2 ) == 0 or ('AGR' + str(y)) == 'AGR9':
      agreeableness_score += df_fraction['AGR' + str(y)].iloc[x]
    else:
      if df_fraction['AGR' + str(y)].iloc[x] == 1:
        agreeableness_score += 5

      elif df_fraction['AGR' + str(y)].iloc[x] == 2:
        agreeableness_score += 4

      elif df_fraction['AGR' + str(y)].iloc[x] == 4:
        agreeableness_score += 2

      elif df_fraction['AGR' + str(y)].iloc[x] == 5:
        agreeableness_score += 1

      else:
        agreeableness_score += 3

  total_agreeableness.append(int(agreeableness_score))
  agreeableness_score = 0

for x in range(0,5076):
  for y in range(1,11):
    if (y % 2 ) != 0 or ('CSN' + str(y)) == 'CSN10':
      conscientiousness_score  += df_fraction['CSN' + str(y)].iloc[x]
    else:
      if df_fraction['CSN' + str(y)].iloc[x] == 1:
        conscientiousness_score  += 5

      elif df_fraction['CSN' + str(y)].iloc[x] == 2:
        conscientiousness_score  += 4

      elif df_fraction['CSN' + str(y)].iloc[x] == 4:
        conscientiousness_score  += 2

      elif df_fraction['CSN' + str(y)].iloc[x] == 5:
        conscientiousness_score  += 1

      else:
        conscientiousness_score  += 3

  total_conscientiousness.append(int(conscientiousness_score ))
  conscientiousness_score  = 0

for x in range(0,5076):
  for y in range(1,11):
    if (y % 2 ) != 0 or (('OPN' + str(y)) == 'OPN8' or ('OPN' + str(y)) == 'OPN10'):
      openess_score += df_fraction['OPN' + str(y)].iloc[x]
    else:
      if df_fraction['OPN' + str(y)].iloc[x] == 1:
        openess_score += 5

      elif df_fraction['OPN' + str(y)].iloc[x] == 2:
        openess_score += 4

      elif df_fraction['OPN' + str(y)].iloc[x] == 4:
        openess_score += 2

      elif df_fraction['OPN' + str(y)].iloc[x] == 5:
        openess_score += 1

      else:
        openess_score += 3

  total_oppeness.append(int(openess_score))
  openess_score = 0




In [286]:
df_fraction.insert(loc=0, column='Extraversion Scores (Min=20, Max=50)', value=total_extraversion, allow_duplicates=True)
df_fraction.insert(loc=0, column='Neuroticism Scores (Min=20, Max=50)', value=total_neuroticism, allow_duplicates=True)
df_fraction.insert(loc=0, column='Agreeableness Scores (Min=20, Max=50)', value=total_agreeableness, allow_duplicates=True)
df_fraction.insert(loc=0, column='Conscientiousness Scores (Min=20, Max=50)', value=total_conscientiousness, allow_duplicates=True)
df_fraction.insert(loc=0, column='Oppeness Scores (Min=20, Max=50)', value=total_oppeness, allow_duplicates=True)
df_fraction

Unnamed: 0,"Oppeness Scores (Min=20, Max=50)","Conscientiousness Scores (Min=20, Max=50)","Agreeableness Scores (Min=20, Max=50)","Neuroticism Scores (Min=20, Max=50)","Extraversion Scores (Min=20, Max=50)","Oppeness Scores (Min=20, Max=50).1","Conscientiousness Scores (Min=20, Max=50).1","Agreeableness Scores (Min=20, Max=50).1","Neuroticism Scores (Min=20, Max=50).1","Extraversion Scores (Min=20, Max=50).1","Oppeness Scores (Min=20, Max=50).2","Conscientiousness Scores (Min=20, Max=50).2","Agreeableness Scores (Min=20, Max=50).2","Neuroticism Scores (Min=20, Max=50).2","Extraversion Scores (Min=20, Max=50).2","Conscientiousness Scores (Min=20, Max=50).3","Agreeableness Scores (Min=20, Max=50).3","Neuroticism Scores (Min=20, Max=50).3","Extraversion Scores (Min=20, Max=50).3",Agreeableness Scores,Neuroticism Scores,Extraversion Scores,EXT1,Neuroticism Scores.1,Extraversion Scores.1,Neuroticism Scores.2,Extraversion Scores.2,Neuroticism Scores.3,Extraversion Scores.3,EXT2,EXT3,EXT4,EXT5,EXT6,EXT7,EXT8,EXT9,EXT10,EST1,EST2,...,AGR1_E,AGR2_E,AGR3_E,AGR4_E,AGR5_E,AGR6_E,AGR7_E,AGR8_E,AGR9_E,AGR10_E,CSN1_E,CSN2_E,CSN3_E,CSN4_E,CSN5_E,CSN6_E,CSN7_E,CSN8_E,CSN9_E,CSN10_E,OPN1_E,OPN2_E,OPN3_E,OPN4_E,OPN5_E,OPN6_E,OPN7_E,OPN8_E,OPN9_E,OPN10_E,dateload,screenw,screenh,introelapse,testelapse,endelapse,IPC,country,lat_appx_lots_of_err,long_appx_lots_of_err
0,45,32,39,24,46,26,32,39,24,46,24,32,39,24,46,32,39,24,46,39,24,46,4.0,24,46,24,46,30,46,1.0,5.0,2.0,5.0,1.0,5.0,2.0,4.0,1.0,1.0,4.0,...,4750.0,5475.0,11641.0,3115.0,3207.0,3260.0,10235.0,5897.0,1758.0,3081.0,6602.0,5457.0,1569.0,2129.0,3762.0,4420.0,9382.0,5286.0,4983.0,6339.0,3146.0,4067.0,2959.0,3411.0,2170.0,4920.0,4436.0,3116.0,2992.0,4354.0,2016-03-03 02:01:01,768.0,1024.0,9.0,234.0,6,1,GB,51.5448,0.1991
1,35,37,44,25,20,29,37,44,25,20,21,37,40,21,20,37,40,21,20,40,21,20,3.0,21,20,21,20,37,20,5.0,3.0,4.0,3.0,3.0,2.0,5.0,1.0,5.0,2.0,3.0,...,2158.0,2090.0,2143.0,2807.0,3422.0,5324.0,4494.0,3627.0,1850.0,1747.0,5163.0,5240.0,7208.0,2783.0,4103.0,3431.0,3347.0,2399.0,3360.0,5595.0,2624.0,4985.0,1684.0,3026.0,4742.0,3336.0,2718.0,3374.0,3096.0,3019.0,2016-03-03 02:01:20,1360.0,768.0,12.0,179.0,11,1,MY,3.1698,101.706
2,41,34,42,26,25,28,34,42,26,25,26,34,40,26,25,34,40,26,25,40,26,25,2.0,26,25,26,25,30,25,3.0,4.0,4.0,3.0,2.0,1.0,3.0,2.0,5.0,4.0,4.0,...,1089.0,2203.0,3386.0,1464.0,2562.0,1493.0,3067.0,13719.0,3892.0,4100.0,4286.0,4775.0,2713.0,2813.0,4237.0,6308.0,2690.0,1516.0,2379.0,2983.0,1930.0,1470.0,1644.0,1683.0,2229.0,8114.0,2043.0,6295.0,1585.0,2529.0,2016-03-03 02:01:56,1366.0,768.0,3.0,186.0,7,1,GB,54.9119,-1.3833
3,39,25,38,29,26,31,25,38,29,26,27,25,38,27,26,25,38,27,26,38,27,26,2.0,27,26,27,26,33,26,2.0,2.0,3.0,4.0,2.0,2.0,4.0,1.0,4.0,3.0,3.0,...,6062.0,11952.0,1040.0,2264.0,3664.0,3049.0,4912.0,7545.0,4632.0,6896.0,2824.0,520.0,2368.0,3225.0,2848.0,6264.0,3760.0,10472.0,3192.0,7704.0,3456.0,6665.0,1977.0,3728.0,4128.0,3776.0,2984.0,4192.0,3480.0,3257.0,2016-03-03 02:02:02,1920.0,1200.0,186.0,219.0,7,1,GB,51.75,-1.25
4,48,48,46,19,29,23,48,46,19,29,23,48,42,23,29,48,42,23,29,42,23,29,3.0,23,29,23,29,29,29,3.0,3.0,3.0,5.0,3.0,3.0,5.0,3.0,4.0,1.0,5.0,...,6771.0,2819.0,3682.0,2511.0,16204.0,1736.0,28983.0,1612.0,2437.0,4532.0,3843.0,7019.0,3102.0,3153.0,2869.0,6550.0,1811.0,3682.0,21500.0,20587.0,8458.0,3510.0,17042.0,7029.0,2327.0,5835.0,6846.0,5320.0,11401.0,8642.0,2016-03-03 02:02:57,1366.0,768.0,8.0,315.0,17,2,KE,1.0,38.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5071,37,40,37,37,20,35,40,37,37,20,33,40,35,33,20,40,35,33,20,35,33,20,1.0,33,20,33,20,37,20,5.0,5.0,4.0,3.0,3.0,2.0,5.0,1.0,5.0,5.0,3.0,...,2461.0,2134.0,2477.0,1751.0,2869.0,2850.0,2568.0,3565.0,1884.0,5105.0,1937.0,4724.0,1824.0,4000.0,2320.0,2792.0,1389.0,1838.0,1599.0,2611.0,2133.0,3566.0,817.0,3380.0,1297.0,2451.0,5867.0,2199.0,1273.0,1328.0,2016-03-06 20:45:58,1366.0,768.0,9.0,139.0,14,1,US,38.2552,-85.5459
5072,44,32,43,40,13,40,32,43,40,13,40,32,39,40,13,32,39,40,13,39,40,13,1.0,40,13,40,13,34,13,4.0,2.0,5.0,1.0,4.0,1.0,5.0,1.0,5.0,5.0,4.0,...,6492.0,4091.0,2484.0,1789.0,8119.0,2405.0,4029.0,3659.0,2194.0,1755.0,0.0,8538.0,3731.0,3301.0,3936.0,5596.0,2665.0,14577.0,3209.0,2394.0,6152.0,11628.0,13926.0,3267.0,3984.0,2372.0,6900.0,3464.0,38169.0,5166.0,2016-03-06 20:47:15,1920.0,1080.0,16.0,274.0,12,1,US,42.9773,-87.8941
5073,43,44,41,29,41,29,44,41,29,41,29,44,37,29,41,44,37,29,41,37,29,41,4.0,29,41,29,41,35,41,1.0,4.0,2.0,5.0,1.0,4.0,4.0,4.0,2.0,4.0,2.0,...,4583.0,6525.0,2901.0,3218.0,6384.0,2889.0,4009.0,2999.0,4558.0,2993.0,3508.0,4884.0,3149.0,3682.0,2450.0,5818.0,2233.0,2443.0,2234.0,3409.0,3184.0,4881.0,2140.0,3035.0,2441.0,4225.0,3111.0,3091.0,3916.0,2736.0,2016-03-06 20:48:20,768.0,1024.0,6.0,185.0,10,1,US,27.0775,-80.2587
5074,26,34,33,27,39,31,34,33,27,39,25,34,31,25,39,34,31,25,39,31,25,39,3.0,25,39,25,39,37,39,1.0,5.0,4.0,5.0,3.0,5.0,3.0,3.0,1.0,2.0,2.0,...,3835.0,4268.0,3403.0,2998.0,5500.0,3700.0,4150.0,6252.0,3731.0,4317.0,5718.0,4130.0,2815.0,3282.0,5766.0,4330.0,3254.0,2666.0,1971.0,3101.0,4534.0,4083.0,3002.0,3802.0,2684.0,3683.0,4466.0,11121.0,3815.0,2681.0,2016-03-06 20:49:07,1024.0,600.0,36.0,323.0,12,1,MY,3.1698,101.7026


In [284]:
df_fraction['OPN10']

0       5.0
1       3.0
2       4.0
3       3.0
4       5.0
       ... 
5071    3.0
5072    4.0
5073    5.0
5074    3.0
5075    4.0
Name: OPN10, Length: 5076, dtype: float64

## 4.2. Modelling assumptions <a class="anchor" id="ModellingAssumptions"></a>
Many modelling techniques make specific assumptions about the data, for example that all attributes have uniform distributions, no missing values allowed, class attribute must be symbolic etc. Record any assumptions made.

- 
- 


## 5.3. Build Model <a class="anchor" id="BuildModel"></a>
Run the modelling tool on the prepared dataset to create one or more models.

**Parameter settings** - With any modelling tool there are often a large number of parameters that can be adjusted. List the parameters and their chosen values, along with the rationale for the choice of parameter settings.

**Models** - These are the actual models produced by the modelling tool, not a report on the models.

**Model descriptions** - Describe the resulting models, report on the interpretation of the models and document any difficulties encountered with their meanings.

## 6. Results/Data/Findings <a class="anchor" id="Results"></a>
- The answer most frequently occuring as 5.0 from the first 50 questions is OPN9 at 456735 and the least likely to be answered as 5.0 is OPN4 at 33235.
- The question of OPN9 is 'I spend time reflecting on things' and the question for OPN4 is 'I am not interested in abstract ideas'. 
- A high score in OPN9 and a low score in OPN4 both increase the score of Openness due to OPN4 being reversed.

- Use the format that will achieve this most effectively: e.g. text, graphs, tables or diagrams. 
- Don’t repeat the same information in two visual formats (e.g. a graph and a table).  
- Give each figure a title and describe in words what the figure demonstrates. 
- Save your interpretation of the results for the Discussion section. 
- In most data mining projects a single technique is applied more than once and data mining results are generated with several different parameters 


# 6. Discussion <a class="anchor" id="Discussion"></a>	

Discussion
- This is probably the longest writing section. 
- It brings everything together, showing how your findings respond to the brief you explained in your introduction and the previous research you surveyed in your literature review. 
- This is the place to mention if there were any problems (e.g. your results were different from expectations, you couldn’t find important data, or you had to change your method or participants) and how they were, or could have been, solved.
- Interpret the models according to your domain knowledge, your data mining success criteria and your desired test design. 
- Judge the success of the application of modelling and discovery techniques technically, then in the business context. 




# 7. Conclusion <a class="anchor" id="Conclusion"></a>
- Should be a short section with no new arguments or evidence. 
- Sum up the main points of your research. How do they answer the original brief for the work reported on? This section may also include: 
    - Recommendations for action 
    - Suggestions for further research 

# 8. Reference List/Bibliography <a class="anchor" id="Reference"></a>

- List full details for any works you have referred to in the report. 
- For the correct style of referencing to use, check college guidelines.  
- If you are uncertain about how or when to reference, see the college library referencing guide.
