# Final Project: 9. Analytical Thinking

## 1) Choosing a data set (30 mins)
For the upcoming project, you need to choose a data set. You could either choose your own data set, or use one of the three sets linked below. 

1.   **Data Set 1:** Stack Overflow survey for Germany 2022

You may analyze the Annual Stack Overflow Developer Survey from 2022. It contains about 70,000 responses fielded from over 180 countries and ultimately provides an overview about the global developer community using Stack Overflow. The random sample cotains 20% of the original data set.

- the random sample is available [here](https://raw.githubusercontent.com/ReDI-School/nrw-data-analytics/main/9_StackOverflowSurvey_Sample_random20percent.csv) - *recommended*
- you can also download the full dataset [here](https://insights.stackoverflow.com/survey) - *very big*
- get insights [here](https://survey.stackoverflow.co/2022/)

-----------------

2.   **Data Set 2:** Xinjiang Detention Centers

You may analyze data on prisoners of the Xinjian region in China. Sincce the spring of 2017, China's Xinjiang Uyghur Autonomous Region has seen a drastic rise in the mass incarcerations of its ethnic minority citizens – most notably, the Uyghur, Kazakh, Kyrgyz, and Hui – with hundreds of thousands being locked up in detention centers. 

- you can download different data sets (deaths, camps...) [here](https://shahit.biz/eng/#lists) or  [here](https://xjdp.aspi.org.au/resources/xinjiangs-detention-facilities/)


-----------------

3.  **Data Set 3:** KiGGS longtitudinal study of childens health in Germany

The KiGGS is a long-term study conducted y the Robert Koch institute to monitor the health of children, adolescents and young adults in Germany. Herefore, a mzltitude of factors that could influence a childs health and health outcomes are tracked.

####-> DATA IS IN GERMAN! DESCRIPTION ALSO AVAILABLE IN ENGLISH!

- dowload data [here](https://www.icloud.com/attachment/?u=https%3A%2F%2Fcvws.icloud-content.com%2FB%2FAQZWEVZl6Rbqfzi2cvOFkNqiOz1NATmff9gA-o9tC39z6AoIkxNODRFZ%2F%24%7Bf%7D%3Fo%3DAgtYIYANQR1jKX_JQUt9xBWGXn-9TqE-mjBt0R32QpSN%26v%3D1%26x%3D3%26a%3DCAogDFZ_030BtACfjKNn2Th44b0oMR2AiBJ8KAl72X-uoZYSdhDF_82Y0DAYxY_J7NkwIgEAKgkC6AMA_z4Bat1SBKI7PU1aBE4NEVlqJTXTfMuQze9vwIiTC6bX6fyMEHr8i3K8X2Z8LcK3p75FoexHaCFyJTXW-HyjsyJYt6eTnqyJX6DkOtaed9LRWB-TPtfiK17ZcN7iFTk%26e%3D1673385887%26fl%3D%26r%3DA5B367E6-5082-4387-BE82-F5C4065778DB-1%26k%3D%24%7Buk%7D%26ckc%3Dcom.apple.largeattachment%26ckz%3DD0CB4BD2-FD03-4686-9C71-782D43341A29%26p%3D34%26s%3DR8AMO5k1O0jqfOJo3YEJgGkYi8A&uk=3abPxkbaXmrY_zmj3G25ag&f=KiGGS03_06.sav&sz=26878994)
- get insights [here](https://www.kiggs-studie.de/english/survey.html)



## 2) Analysis (120 mins)

Step 1: **Set a Goal!** -> Think in terms of statistical plan. Look through the questions below, those are examples and references that may be a starting point for your analysis.

**Stack Overflow Survey:**
1.   What kind of information does the survey contain?
2.   Which variables are particularly interesting to you and why? 
3. Analyze the salary and how different variables are correlating with it.

**Prisoners & Detention Centers**
1. Do the dataset reflect the same underlying processes i.e. do the missed person contribute to populate local prisons?
2. Visualize the data according to time
3. Calculate and/or visualize the data according to location (difficult).

**Children's Health**
1. Investigate blood levels of HbA1c which is a blood biomarker of blood sugar levels and therefore an indicator of diabetes
2. Variables that might be associated are bmiB (body mass index), sex, age2 (sex) and the amount and frequency of eating chips (fq44/ fq44a)


----------------------
General questions:
1. Descriptive statistics & graphs
* What are the scales (nominal, ordinal, metric) of the variables that you want to analyze?
* for each variable decide which descriptive statistic is best suited to describe it: (i) frequencies, (ii) mean/ standard deviation, (iii) deviation
* indicate how many missing values each variable you analyze has
*...
2. Analysis 
* for example use linear regression or logistic regression
* formulate a goal and hypothesis and test those
*...

---------------

###**1) Import data**

**a) Import from local storage after downloading csv from website**

In [25]:
import pandas as pd
import psycopg2 as ps

In [6]:
# Example: Full Stack Overflow Survey
from google.colab import files
uploaded = files.upload()

Saving 9_survey_results_public.csv to 9_survey_results_public.csv


In [8]:
import pandas as pd
import io
 
df = pd.read_csv(io.BytesIO(uploaded['9_survey_results_public.csv']))
len(df)

73268

**b) Import from GitHub**

In [2]:
# Random sample of the survey data
url = 'https://raw.githubusercontent.com/ReDI-School/nrw-data-analytics/main/9_StackOverflowSurvey_Sample_random20percent.csv'
surveydata = pd.read_csv(url)
print(surveydata.head())

   ResponseId                      MainBranch  \
0        8553  I am a developer by profession   
1       70297  I am a developer by profession   
2       37158  I am a developer by profession   
3       52069     I code primarily as a hobby   
4        9003  I am a developer by profession   

                                          Employment  \
0                                Employed, full-time   
1                                Employed, full-time   
2                                Employed, full-time   
3  Student, full-time;Not employed, but looking f...   
4  Employed, full-time;Independent contractor, fr...   

                             RemoteWork                CodingActivities  \
0                        Full in-person                           Hobby   
1  Hybrid (some remote, some in-person)                           Hobby   
2                          Fully remote  Hobby;Bootstrapping a business   
3                                   NaN                             

In [3]:
# KiGGS data set
url2 = 'https://raw.githubusercontent.com/ReDI-School/nrw-data-analytics/main/9_KiGGS03_06.csv'
kiggsdata = pd.read_csv(url2)
print(kiggsdata.head())

   HbA1c       bmiB       sex        age2               fq44  \
0    4.2  21.929268  Männlich  12 - 13 J.     1 mal im Monat   
1    5.4  16.546827  Männlich  10 - 11 J.   2-3 mal im Monat   
2    4.4  22.564967  Weiblich    8 - 9 J.   2-3 mal im Monat   
3    3.6  20.714774  Weiblich  14 - 15 J.  1-2 mal pro Woche   
4    NaN  15.816788  Männlich    4 - 5 J.   2-3 mal im Monat   

                       fq44a  
0  1/4 Schale (oder weniger)  
1                  2 Schalen  
2                   1 Schale  
3                   1 Schale  
4                  2 Schalen  


**c) Mounting Google Drive** 

In [20]:
import pandas as pd
import seaborn as sns

In [21]:
# codegrepper.com/code-examples/go/access+drive+from+colab
from google.colab import drive

In [22]:
drive.mount("/content/gdrive")

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [23]:
!ls '/content/gdrive/MyDrive/ReDi 2022'

 3_Pandas_Project.ipynb
 3_Pandas_Project_Solutions.ipynb
 9_Stackoverflow_Analytical_Thinking_Exercise_version_2021_Dec_13.ipynb
 9_Updated_Project_Intro.ipynb
'Copy of 1. Kick-Off Lecture.ipynb'
'Copy of 1_Kick_Off_Project.ipynb'
'Copy of 2. Statistics.ipynb'
'Copy of 2_Statistics_Project.ipynb'
'Copy of 3.1 Lecture Intro to Pandas [Filter & Groupby].ipynb'
'Copy of 3_1 Project Intro to Pandas [Filters & Groupby].ipynb'
'Copy of 3.2 Pandas Transformations Lecture.ipynb'
'Copy of 3.2 Pandas Transformations Project.ipynb'
'Copy of 4. Data Structures Project.ipynb'
'Copy of 7 Storytelling & dashboards - Lecture.ipynb'
'Copy of Berlin_flats.ipynb'
'Copy of Lecture — Introduction to analytical thinking.ipynb'
'Copy of Project — Intro to Analytical Thinking .ipynb'
 Lecture_9_EDA_Summary_Stackoverflow_Analytical_Thinking.ipynb
 Untitled
 xjvictims_campfac.csv
 xjvictims_deaths.csv


In [28]:
filepath = r"/content/gdrive/MyDrive/ReDi 2022/xjvictims_campfac.csv"
xjvictims_campfac = pd.read_csv(filepath, sep=';', header=None)
xjvictims_campfac

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,27,28,29,30,31,32,33,34,35,36
0,Entry,Victim's Name,Chinese name,ID no.,About the Testifier,Relation,About the Victim,Assumed Location,When Detention Took Place,Detention Reason,...,Status,When problems started,Detention reason,Official detention reason (1),Official detention reason (2),Official detention reason (3),Health status,Profession,Locality (latitude),Locality (longitude)
1,11,Rozigul Abdureshit,茹则古丽·阿卜杜热西提,65312819800118??E?,"Testimony 1*|2|3|4|5: Memtiminjan Memet, origi...",,رىزۋانگۈل ئابدۇرىشىت 2016-يىلى 4-ئايدا تۈركىيە...,پەيزىئاۋات ناھىيە تىرىم يېزىسى لاگىردا,2017/4<br /><br />[Testimonies 2-5: She arrive...,تۈركىيەگە كەلگەنلىك,...,no news for over a year,Apr. 2018 - June 2018,related to going abroad,---,---,---,has problems,---,,
2,25,Irpan Hezim,,654221199???????O?,"Testimony 1: Local police, as reported by Radi...",,"Erfan Hezim, 19. Footballer in the Chinese Sup...","Presumably in inner China, as he's back to pla...",Testimony 1: Detained by Dorbiljin Market [lik...,"Testimony 1: Reportedly, for ""visiting foreign...",...,other,Jan. 2018 - Mar. 2018,---,related to going abroad,---,---,---,athlete,,
3,116,Erkin Qami,叶尔肯·哈密,652501195805052115,"Testimony 1|2|3|4|5|6: Munira Erkin, born in 1...",,Erkin Qami (叶尔肯*哈密) is a Chinese citizen who o...,Unclear if he is still in Xinjiang or if he ha...,"earlier: Arrested on December 25, 2017.<br /><...",---,...,---,Oct. 2017 - Dec. 2017,---,---,---,---,has problems,---,46.747.937,82.984.721
4,120,Erzhan Qurban,叶尔江·库尔帮,654121197807014272,"Testimony 1|2|4|5|6|8: Mainur Medetbek, now a ...",,Erzhan Qurbanuly (叶尔江*库尔帮) took his daughter t...,Back in Kazakhstan.,Testimony 2: before the Chinese Spring Festiva...,unclear,...,free,Oct. 2017 - Dec. 2017,---,---,---,---,has problems,"farmwork, herding",43.894.823,81.954.046
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
901,47386,Idris Qunduz,伊德日斯·坤杜孜,65312219530405341X,"Testimony 1|3|4: Local police records, as repo...",,Idris Qunduz.<br /><br />Registration address:...,Testimony 2: died in a prison in Urumqi.<br />...,"Taken into custody on July 27, 2017.<br /><br ...",Provided reasons for being taken into custody:...,...,sentenced,July 2017 - Sep. 2017,---,disturbing public order,related to religion,extremism,deceased,"farmwork, herding",3.914.114,76.179.189
902,48352,Sajidehan Shukur,萨吉达罕·许库尔,653121194605122625,"Local police records, as reported by Anonymous...",,Sajidehan Shukur was a farmer.<br /><br />Regi...,[Presumably in Kashgar.],"Sent to ""transformation through education"" at ...",Sent to camp for being a buwi (female religiou...,...,unclear (soft),---,---,related to religion,---,---,critical,"farmwork, herding",39.489.424,75.632.155
903,48389,Qasim Memet,喀斯木·麦麦提,653121197201012618,"Local police records, as reported by Anonymous...",,Qasim Memet was a cadre of the township govern...,New Campus Camp in Konasheher County (疏附县新校区).,"Sent to ""transformation through education"" at ...",---,...,concentration camp,---,---,two-faced,---,---,---,government,39.463.524,75.611.879
904,48409,Ayzimhan Tash,阿依孜米汗·塔西,652926197706121128,"Testimony 1|2: Urumqi police records, as repor...",,Ayzimhan Tash.<br /><br />Testimony 2:<br /><b...,---,"As of 2017-2018, the victim was tagged in the ...",---,...,unclear (soft),---,---,---,---,---,---,---,437.703,87.612.395


### Data Exploration


*  see https://towardsdatascience.com/an-introduction-to-exploratory-data-analysis-in-python-9a76f04628b8 
* see Lecture 6 - Exploratory Data Analysis [here](https://github.com/ReDI-School/nrw-data-analytics/blob/main/6_Lecture_More_Plots_and_intro_to_EDA_edited.ipynb)
*   read the documentation of the dataset, if there is any!
* If necessary, get rid of NAs


In [None]:
df.info()

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.isna().sum()

In [None]:
df.count()

In [None]:
df.columns

### Analyze

*“The basic general intent of data analysis is simply stated: to seek through a body of data for interesting relationships and information and to exhibit the results in such a way as to make them recognizable to the data analyzer and recordable for posterity"* - J.W.Tukey 

* formulate a goal for your analysis
* generate and answer hypothesis to test in order to reach that goal
* create visualizations

In [None]:
....

# Resources
- (1) https://app.gitbook.com/@redi-school-1/s/data-analytics/sql-and-databases
- (2) https://app.gitbook.com/@redi-school-1/s/data-analytics/analytical-thinking
- (3) https://www.stat.berkeley.edu/users/brill/Stat215b/oct7.pdf
- (4) https://www.youtube.com/watch?v=GSk-EEu1zkA (1/2) and https://www.youtube.com/watch?v=i5E2hruuLaQ (2/2)
- (5) https://www.youtube.com/watch?v=N00g9Q9stBo&t=3307s
- (6) https://www.youtube.com/watch?v=vc1bq0qIKoA


<font color='green'>**Next**
- Project
- Splunk, PowerBI or Tableau?
- Fisher's exact test, ...
- supervised vs unsupervised (ML)
</font>