# Datenjournalismus in Python - 
# Eine praktische Einführung in die Programmierung


### Natalie Widmann




Wintersemester 2022 / 2023


Universität Leipzig





## Organisatorisches

### Projekt

- Projektpräsentation und Kursabschluss: 26. Januar 2023
- Projektabgabe: 16. Februar 2023



![Timeline](../imgs/timeline.png)

# Blick zurück: Teil I - Grundlagen in Python


- Code lesen, verstehen und selbst schreiben
- Python Programme ausführen und Fehler analysieren
- einfache Daten (Listen oder Dictionaries) bearbeiten und analysieren
- Strings modellieren und Informationen extrahieren
- Berechnungen ausführen und Vergleiche anstellen

- Anwendungsbeispiele aus dem journalistischen Alltag:
    - Jeff Bezos Rechner
    - Automatisierten Texte
    - Luftqualität in Leipzig
    - Nebeneinkünfte von Lokalpolitker:innen in Rheinlandpfalz



![Datenpipeline](../imgs/datapipeline.png)


## Teil II - Datenanalyse in Python


### Ziele

- Verständnis der Datenverarbeitung
- strukturierte Daten bearbeiten und analysieren
- Visualisierung von Daten
- Python Module verwenden
- unterschiedliche Datenformate (csv, json, excel, txt) einlesen und speichern

# Was sind Daten?


Strukturierte Daten: Vorteil & Beispiele

Bei strukturierten Daten handelt es sich um all jene Daten, die in einer immer gleichen Struktur und gleichem Format verfügbar sind. Darüber hinaus ist dieses Format bekannt bzw. dokumentiert. Der Vorteil solcher Daten ist, dass diese mittels Algorithmen oder einfachen Anweisungen innerhalb kurzer Zeit verarbeitet werden können.
Beispiele für strukturierte Daten

    immer gleiche Zeilen in einer Excel Datei (z.B. Bestellungen)
    Daten aus relationalen Datenbanken

Unstrukturierte Daten: Herausforderung & Beispiele

Im Gegensatz zu den strukturierten Daten sind unstrukturierte Daten nicht in einer Form verfügbar, die von einfachen Algorithmen oder Anweisungen verarbeitet werden können. Diese Daten kommen im alltäglichen Leben sehr häufig vor. Jedes E-Mail, jede Website besteht im Grunde aus unstrukturierte Daten. Die wesentlichen Informationen aus dem Text zu lesen war bisher die Aufgabe des Menschen. Ein fest vorgegebener Algorithmus schafft es in der Regel nicht, die vielen Möglichkeiten der Datendarstellungen zu nutzen.
Beispiele für unstrukturierte Daten

    Inhalt einer E-Mail
    Power-Point-Präsentationen
    Webseiten-Inhalte
    Textdateien
    Videos

Semi-Strukturierte Daten: eine Mischform

Wenn strukturierte und unstrukturierte Daten gemischt auftreten, entstehen so genannte semi-strukturierte Daten. So können zum Beispiel in einer Datenbank lange Textfelder mit irgendwelchen undefinierten Daten auftreten. Wobei der Datensatz an sich dann selbst strukturiert ist und der Wert des Textfeldes unstrukturiert.
Beispiel für semi-strukturierte Daten

E-Mail: Empfänger, Betreffzeile und Absender besitzen eine Struktur, der eigentliche Text ist unstrukturiert –> als Gesamtpaket ist die E-Mail also semi-strukturiert

(Quelle: https://www.status-kwo.at/blog/strukturierte-vs-unstrukturierte-daten/)

# Sind Daten objektiv?



# Sind Daten objektiv?


## NEIN.

Learning about data journalism begins with understanding how to think critically about information and how it can be collected, normalized and analyzed for journalistic purposes. It begins with figuring out the story, and asking the questions that get you there.
And journalism educators likely already know the form those questions can take:


– Who created the data?
– What is the data supposed to include?
– When was the data last updated?
– Where in the world does the data represent?
– Why do we need this data to tell our story?
– How do we and the answers to the questions we want to ask of this data?


## Python Packages



In [3]:
# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install pandas

Collecting pandas
  Using cached pandas-1.5.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.2 MB)
Collecting pytz>=2020.1
  Using cached pytz-2022.6-py2.py3-none-any.whl (498 kB)
Collecting numpy>=1.20.3; python_version < "3.10"
  Downloading numpy-1.23.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
[K     |████████████████████████████████| 17.1 MB 636 kB/s eta 0:00:01
Installing collected packages: pytz, numpy, pandas
Successfully installed numpy-1.23.4 pandas-1.5.1 pytz-2022.6


## Pandas

https://pandas.pydata.org/

Pandas is a Python library that provides extensive means for data analysis. Data scientists often work with data stored in table formats like .csv, .tsv, or .xlsx. Pandas makes it very convenient to load, process, and analyze such tabular data using SQL-like queries. In conjunction with Matplotlib and Seaborn, Pandas provides a wide range of opportunities for visual analysis of tabular data.

The main data structures in Pandas are implemented with Series and DataFrame classes. The former is a one-dimensional indexed array of some fixed data type. The latter is a two-dimensional data structure - a table - where each column contains data of the same type. You can see it as a dictionary of Series instances. DataFrames are great for representing real data: rows correspond to instances (examples, observations, etc.), and columns correspond to features of these instances.

## Udemy Course Data

Link: https://www.kaggle.com/datasets/rexxxxxxx/udemy-courses



In [6]:
import pandas as pd

data = pd.read_csv('../data/udemy_courses_2021.csv')

In [7]:
data

Unnamed: 0,Course Name,Course URL,Categories,Short Description,Long Description,Difficulty,Duration,Free Option,Rating,Original rating,Numberofrated,Numberofenroll,Paid Option,Language,Subtitle Language,Platform,Provider,Image URL
0,The Ultimate Drawing Course - Beginner to Adva...,https://www.udemy.com/course/the-ultimate-draw...,"Graphic Design & Illustration,Drawing",Learn the #1 most important building block of ...,,0,1,0,1.007737e-01,4.600000,109305,453470,NT$470,English,"English,French",1,"Jaysen Batchelor, Quinton Ross",
1,Character Art School: Complete Character Drawi...,https://www.udemy.com/course/character-art-sch...,"Design,Other Design,Character Design",Learn How to Draw People and Character Designs...,,0,1,0,5.363628e-02,4.600000,58177,260798,NT$470,English,"English,French",1,Scott Harris,
2,Complete Blender Creator: Learn 3D Modelling f...,https://www.udemy.com/course/blendertutorial/,"Design,3D & Animation,Blender",Use Blender to Create Beautiful 3D models for ...,,0,1,0,4.353540e-02,4.600000,47221,238288,NT$470,English,"English,French",1,"GameDev.tv Team, Rick Davidson, Grant Abbitt",
3,Design Thinking in 3 Steps,https://www.udemy.com/course/designit-design-t...,"Design,User Experience Design,Design Thinking","Understand your audience, envision a creative ...",,0,0,0,3.278510e-02,4.400000,37177,83499,"NT$6,590",English,"English,French",1,"Designit Strategic Design, Alan Cooper",
4,User Experience Design Essentials - Adobe XD U...,https://www.udemy.com/course/ui-ux-web-design-...,"Design,User Experience Design,User Interface","Use XD to get a job in UI Design, User Interfa...",,0,1,0,3.176671e-02,4.600000,34456,136757,NT$470,English,"English,French",1,Daniel Walter Scott,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
79281,Limit Sugar Sweetened Beverages to Not More Th...,https://www.udemy.com/course/limit-sugar-sweet...,"Health & Fitness,Nutrition",Recommendation #21 of 30 for Optimizing Health...,,0,0,1,0.000000e+00,4.446485,0,214,0,English,English,1,"Nicholas Cohen, MD",
79282,Limit Processed Foods to Not More Than One Ser...,https://www.udemy.com/course/limit-processed-f...,"Health & Fitness,Nutrition",Recommendation #22 of 31 for Optimizing Health...,,0,0,1,8.911820e-07,4.446485,1,202,0,English,English,1,"Nicholas Cohen, MD",
79283,how to be an expert in the word of bodybuilding,https://www.udemy.com/course/how-to-be-an-expe...,"Health & Fitness,Fitness,Health",The comprehensive guide: prepares you to be an...,,0,0,1,0.000000e+00,4.446485,0,347,0,English,English,1,Anas Idrissi,
79284,Meditation - The Art of Inner Peace and Happin...,https://www.udemy.com/course/meditation-the-ar...,"Health & Fitness,Meditation",This is Part 5 of Meditation - The Art of Inne...,,0,0,1,0.000000e+00,4.446485,0,222,0,English,English,1,Nima King,


### Überblick über die Daten

In [9]:
# head() gibt die ersten 5 Zeilen aus
data.head()

Unnamed: 0,Course Name,Course URL,Categories,Short Description,Long Description,Difficulty,Duration,Free Option,Rating,Original rating,Numberofrated,Numberofenroll,Paid Option,Language,Subtitle Language,Platform,Provider,Image URL
0,The Ultimate Drawing Course - Beginner to Adva...,https://www.udemy.com/course/the-ultimate-draw...,"Graphic Design & Illustration,Drawing",Learn the #1 most important building block of ...,,0,1,0,0.100774,4.6,109305,453470,NT$470,English,"English,French",1,"Jaysen Batchelor, Quinton Ross",
1,Character Art School: Complete Character Drawi...,https://www.udemy.com/course/character-art-sch...,"Design,Other Design,Character Design",Learn How to Draw People and Character Designs...,,0,1,0,0.053636,4.6,58177,260798,NT$470,English,"English,French",1,Scott Harris,
2,Complete Blender Creator: Learn 3D Modelling f...,https://www.udemy.com/course/blendertutorial/,"Design,3D & Animation,Blender",Use Blender to Create Beautiful 3D models for ...,,0,1,0,0.043535,4.6,47221,238288,NT$470,English,"English,French",1,"GameDev.tv Team, Rick Davidson, Grant Abbitt",
3,Design Thinking in 3 Steps,https://www.udemy.com/course/designit-design-t...,"Design,User Experience Design,Design Thinking","Understand your audience, envision a creative ...",,0,0,0,0.032785,4.4,37177,83499,"NT$6,590",English,"English,French",1,"Designit Strategic Design, Alan Cooper",
4,User Experience Design Essentials - Adobe XD U...,https://www.udemy.com/course/ui-ux-web-design-...,"Design,User Experience Design,User Interface","Use XD to get a job in UI Design, User Interfa...",,0,1,0,0.031767,4.6,34456,136757,NT$470,English,"English,French",1,Daniel Walter Scott,


Wie groß ist der Datensatz? Wie viele Zeilen und wie viele Spalten sind vorhanden?

In [12]:
data.shape

(79286, 18)

In [15]:
print(f'Anzahl an Zeilen: {data.shape[0]}')
print(f'Anzahl an Spalten: {data.shape[1]}')

Anzahl an Zeilen: 79286
Anzahl an Spalten: 18


Die Spaltennamen

In [18]:
print(data.columns)

Index(['Course Name', 'Course URL', 'Categories', 'Short Description',
       'Long Description', 'Difficulty', 'Duration', 'Free Option', 'Rating',
       'Original rating', 'Numberofrated', 'Numberofenroll', 'Paid Option',
       'Language', 'Subtitle Language', 'Platform', 'Provider', 'Image URL'],
      dtype='object')


`info()` für mehr Infos über die Spalten

In [20]:
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79286 entries, 0 to 79285
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Course Name        79286 non-null  object 
 1   Course URL         79286 non-null  object 
 2   Categories         79286 non-null  object 
 3   Short Description  79277 non-null  object 
 4   Long Description   0 non-null      float64
 5   Difficulty         79286 non-null  int64  
 6   Duration           79286 non-null  int64  
 7   Free Option        79286 non-null  int64  
 8   Rating             79286 non-null  float64
 9   Original rating    79286 non-null  float64
 10  Numberofrated      79286 non-null  int64  
 11  Numberofenroll     79286 non-null  int64  
 12  Paid Option        79286 non-null  object 
 13  Language           79286 non-null  object 
 14  Subtitle Language  79286 non-null  object 
 15  Platform           79286 non-null  int64  
 16  Provider           792

`describe()` zeigt die grundlegenden statistischen Eigenschaften von Spalten mit numerischem Datentyp, also `int` und `float`. 

Die Methode berechnet:
- die Anzahl an fehlenden Werten
- Durchschnitt
- Standardabweichung
- Zahlenrange
- Media
- 0.25 und 0.75 Quartile

In [21]:
data.describe()

Unnamed: 0,Long Description,Difficulty,Duration,Free Option,Rating,Original rating,Numberofrated,Numberofenroll,Platform,Image URL
count,0.0,79286.0,79286.0,79286.0,79286.0,79286.0,79286.0,79286.0,79286.0,0.0
mean,,0.161819,0.109666,0.058459,0.000302,4.272726,336.278599,4650.728956,1.0,
std,,0.415623,0.324825,0.234611,0.002534,0.492569,2745.212193,17653.776206,0.0,
min,,0.0,0.0,0.0,0.0,0.5,0.0,0.0,1.0,
25%,,0.0,0.0,0.0,7e-06,4.1,8.0,75.0,1.0,
50%,,0.0,0.0,0.0,2.5e-05,4.366739,30.0,533.0,1.0,
75%,,0.0,0.0,0.0,9.4e-05,4.6,110.0,2985.75,1.0,
max,,2.0,2.0,1.0,0.233447,5.0,253210.0,997885.0,1.0,


`.unique()` zeigt die einzigartigen Werte einer Spalte an

In [27]:
data['Difficulty'].unique()

array([0, 1, 2])

In [28]:
data['Categories'].unique()

array(['Graphic Design & Illustration,Drawing',
       'Design,Other Design,Character Design',
       'Design,3D & Animation,Blender', ...,
       'Health & Fitness,Other Health & Fitness,Occupational Therapy',
       'Health & Fitness,Sports,Mental Health',
       'Health & Fitness,General Health,Medical Device Development'],
      dtype=object)

`.value_counts()` zeigt wie oft eine Spalte die unterschiedlichen Werte annimmt.

In [24]:
data['Language'].value_counts()

English    79286
Name: Language, dtype: int64

In [23]:
data['Categories'].value_counts()

Office Productivity,Microsoft,Excel                           446
Teaching & Academics,Language Learning,English Language       339
IT & Software,IT Certifications,Microsoft Certification       328
Lifestyle,Arts & Crafts,Watercolor Painting                   327
Development,Programming Languages,Python                      309
                                                             ... 
Lifestyle,Other Lifestyle,Sexual Health                         1
Lifestyle,Arts & Crafts,Surface Pattern Design                  1
Esoteric Practices,Qi Gong                                      1
Lifestyle,Food & Beverage,Hotel Management                      1
Health & Fitness,General Health,Medical Device Development      1
Name: Categories, Length: 19755, dtype: int64

Mit dem Argument `normalize=True` wird das Vorkommen der Werte automatisch ins Verhältnis gesetzt.

In [31]:
data['Difficulty'].value_counts(normalize=True)

0    0.856734
1    0.124713
2    0.018553
Name: Difficulty, dtype: float64

In [33]:
data['Original rating'].value_counts(normalize=True)

4.500000    0.098946
4.400000    0.091164
4.600000    0.087430
4.300000    0.079737
4.700000    0.071450
4.200000    0.068865
4.100000    0.055420
5.000000    0.053023
4.000000    0.051838
4.800000    0.046969
3.900000    0.039389
3.800000    0.032389
3.700000    0.026852
4.900000    0.025780
3.600000    0.020937
3.500000    0.018351
4.323447    0.017380
4.366739    0.013294
3.400000    0.011490
3.300000    0.010204
4.446485    0.009182
3.200000    0.007315
4.410997    0.006697
3.000000    0.006609
4.187823    0.006458
4.263276    0.006041
3.100000    0.005814
4.177775    0.005070
2.900000    0.003746
4.448509    0.002977
2.800000    0.002510
2.700000    0.002005
4.273280    0.001816
4.306209    0.001640
2.500000    0.001614
2.600000    0.001463
4.241297    0.001438
1.000000    0.001425
2.400000    0.001110
2.000000    0.001047
2.200000    0.000769
2.300000    0.000744
1.500000    0.000366
1.800000    0.000315
2.100000    0.000240
1.900000    0.000227
1.700000    0.000164
1.600000    0


### Dataframes Sortieren

Dataframes können anhand einer oder meherer Spalten sortiert werden.



In [35]:
data.sort_values(by="Original rating")

Unnamed: 0,Course Name,Course URL,Categories,Short Description,Long Description,Difficulty,Duration,Free Option,Rating,Original rating,Numberofrated,Numberofenroll,Paid Option,Language,Subtitle Language,Platform,Provider,Image URL
15340,Strategic Real Estate Investing for Instant Pr...,https://www.udemy.com/course/strategic-real-es...,"Business,Real Estate,Real Estate Investing","Learn the Secrets to Finding, Fixing, and Flip...",,0,0,0,1.002120e-07,0.5,1,71,$14.99,English,English(default),1,"Roberta Eastman, Keith Boley",
15314,Find Jobs Working From Home + Learn Data Entry...,https://www.udemy.com/course/how-to-find-a-job...,"Business,Entrepreneurship,Freelancing",This course is a complete guideline to work fr...,,0,0,0,1.002120e-07,0.5,1,6,$14.99,English,English,1,Nida Tanveer,
56083,Pressure Measurement and Control Fundamental QA,https://www.udemy.com/course/pressure-measurem...,"Teaching & Academics,Engineering,Control Engin...",Fundamental to Advance MCQ Pressure Transmitte...,,0,0,0,1.002120e-07,0.5,1,2,€ 14.99,English,English(default),1,Mahendra Singh,
56011,Simulation Model in Operations Research,https://www.udemy.com/course/simulation-model-...,"Teaching & Academics,Other Teaching & Academic...",Monte Carlo Simulation,,0,0,0,2.004240e-07,1.0,1,4,€ 14.99,English,English(default),1,Dr.Himanshu Saxena,
43524,Entertain with Ventriloquism,https://www.udemy.com/course/beginner-ventrilo...,"Lifestyle,Arts & Crafts,Magic Trick",The basics for talking like a dummy,,0,0,0,2.004240e-07,1.0,1,10,NT$390,English,English(default),1,Michael Stelzer Ph.D.,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26685,Emulates a Drum set with a Cajon & Shakers (Fu...,https://www.udemy.com/course/emulates-a-drum-s...,"Music,Instruments,Percussion Instruction",Learn pecussion and develop your rhythm,,0,0,0,1.002120e-06,5.0,1,5,€ 12.99,English,English(default),1,Carles Planells,
53124,English Pronunciation: Master the American accent,https://www.udemy.com/course/english-pronuncia...,"Teaching & Academics,Language Learning,English...",How To Speak English Clearly and Correctly. Am...,,0,0,0,1.503180e-05,5.0,15,37,€ 14.99,English,English,1,LinkAge Academy,
26684,DJ - How To Be A Tech House DJ And Play At Fes...,https://www.udemy.com/course/dj-how-to-be-a-te...,"Music,Other Music,DJ",Learn How To Be A Tech House DJ And Play At Fe...,,0,0,0,1.002120e-06,5.0,1,3,€ 12.99,English,English,1,Omar Meho,
21851,Learn How To Work From Home Efficiently,https://www.udemy.com/course/learn-how-to-work...,"Personal Development,Personal Productivity,Fre...",Stay efficient while working from home,,0,0,0,1.002120e-06,5.0,1,7,€ 14.99,English,English,1,Olga Pogozheva,


In [38]:
data.sort_values(by="Numberofenroll", ascending=False)

Unnamed: 0,Course Name,Course URL,Categories,Short Description,Long Description,Difficulty,Duration,Free Option,Rating,Original rating,Numberofrated,Numberofenroll,Paid Option,Language,Subtitle Language,Platform,Provider,Image URL
62720,Automate the Boring Stuff with Python Programming,https://www.udemy.com/course/automate/,"Development,Programming Languages,Python",A practical programming course for office work...,,0,0,0,0.086201,4.600000,93499,997885,NT$470,English,"English,French",1,Al Sweigart,
36739,Microsoft Excel - Excel from Beginner to Advanced,https://www.udemy.com/course/microsoft-excel-2...,"Microsoft,Excel",Excel with this A-Z Microsoft Excel Course. Mi...,,0,1,0,0.233447,4.600000,253210,813030,$23.99,English,"English,French",1,"Kyle Pew, Office Newb",
62713,Machine Learning A-Z™: Hands-On Python & R In ...,https://www.udemy.com/course/machinelearning/,"Development,Data Science,Python",Learn to create Machine Learning Algorithms in...,,0,1,0,0.135611,4.500000,150360,805015,NT$630,English,"English,French",1,"Kirill Eremenko, Hadelin de Ponteves, SuperDat...",
62711,The Web Developer Bootcamp 2021,https://www.udemy.com/course/the-web-developer...,"Development,Web Development",COMPLETELY REDONE - The only course you need t...,,0,2,0,0.201701,4.700000,214122,710430,NT$470,English,"English,French",1,Colt Steele,
57867,The Complete Digital Marketing Course - 12 Cou...,https://www.udemy.com/course/learn-digital-mar...,"Marketing,Digital Marketing","Master Digital Marketing Strategy, Social Medi...",,0,1,0,0.125083,4.500000,138687,615597,$26.99,English,"English,French",1,"Rob Percival, Daragh Walsh, Codestars by Rob P...",
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57092,Kick off your career as HSE professional : A P...,https://www.udemy.com/course/transform-yoursel...,"Teaching & Academics,Engineering,Workplace Hea...",Occupational Health safety and environmental p...,,0,0,0,0.000000,4.323447,0,0,€ 14.99,English,English(default),1,AF E Learning,
57091,"Three beginners drawing, painting, and collage...",https://www.udemy.com/course/three-beginners-d...,"Teaching & Academics,Online Education,Art for ...",Create three fun pictures. Create a cake colla...,,0,0,0,0.000000,4.323447,0,0,€ 14.99,English,English,1,Julie Walker,
57085,The College Timeline,https://www.udemy.com/course/the-college-timel...,"Teaching & Academics,Other Teaching & Academic...",The College Admissions Process Explained,,0,0,0,0.000000,4.323447,0,0,€ 14.99,English,English,1,Elisia Howard,
57081,Vegetable Calculus,https://www.udemy.com/course/vegetable-calculus/,"Teaching & Academics,Math,Calculus",Calculus taught like never before (using veget...,,0,0,0,0.000000,4.323447,0,0,€ 14.99,English,English,1,Shreya Kelly,


In [41]:
# Mehrere Argumente zum Sortieren sind möglich
data.sort_values(by=["Original rating", "Numberofenroll"], ascending=[True, False])

Unnamed: 0,Course Name,Course URL,Categories,Short Description,Long Description,Difficulty,Duration,Free Option,Rating,Original rating,Numberofrated,Numberofenroll,Paid Option,Language,Subtitle Language,Platform,Provider,Image URL
15340,Strategic Real Estate Investing for Instant Pr...,https://www.udemy.com/course/strategic-real-es...,"Business,Real Estate,Real Estate Investing","Learn the Secrets to Finding, Fixing, and Flip...",,0,0,0,1.002120e-07,0.5,1,71,$14.99,English,English(default),1,"Roberta Eastman, Keith Boley",
15314,Find Jobs Working From Home + Learn Data Entry...,https://www.udemy.com/course/how-to-find-a-job...,"Business,Entrepreneurship,Freelancing",This course is a complete guideline to work fr...,,0,0,0,1.002120e-07,0.5,1,6,$14.99,English,English,1,Nida Tanveer,
56083,Pressure Measurement and Control Fundamental QA,https://www.udemy.com/course/pressure-measurem...,"Teaching & Academics,Engineering,Control Engin...",Fundamental to Advance MCQ Pressure Transmitte...,,0,0,0,1.002120e-07,0.5,1,2,€ 14.99,English,English(default),1,Mahendra Singh,
56210,2021 Miller Analogies Test (MAT) TOP Practical...,https://www.udemy.com/course/miller-analogies-...,"Teaching & Academics,Language Learning,College...",97% Pass in First Attempt Easily Most Common M...,,0,0,0,2.004240e-07,1.0,1,3647,€ 14.99,English,English(default),1,SMARTER ACADEMY,
56316,Edexcel IGCSE 4CN2-01 2018 Reading Mock Quiz,https://www.udemy.com/course/edexcel-igcse-4cn...,"Teaching & Academics,Language Learning,Chinese...",A Quick Reference to past Exam Papers IGCSE 4C...,,1,0,0,2.004240e-07,1.0,1,702,€ 14.99,English,English(default),1,David Yao,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
77960,The Ultimate Dream Body: Fat Burner Level 1,https://www.udemy.com/course/the-ultimate-drea...,"Health & Fitness,Fitness,Weight Loss",At Home and Without equipements,,0,0,0,1.002120e-06,5.0,1,1,$12.99,English,English(default),1,Monaim Fitness,
77965,Youth Yoga & Mindfulness for Stress and Anxiety,https://www.udemy.com/course/youth-yoga-mindfu...,"Health & Fitness,Yoga",Practical breathing exercises and yoga sequenc...,,0,0,0,1.002120e-06,5.0,1,1,$12.99,English,English,1,Lizandra Deister,
77967,Certified Mind Management Expert,https://www.udemy.com/course/certified-mind-ma...,"Health & Fitness,Mental Health,Meditation",Basics of Mind Management,,0,0,0,1.002120e-06,5.0,1,1,$12.99,English,English,1,Shanmugam Subramonian,
78010,"Diarrhea: Types, Causes, Consequences and Mana...",https://www.udemy.com/course/diarrhea-types-ca...,"Health & Fitness,General Health,Health",All you need to know about Diarrhea and how to...,,0,0,0,1.002120e-06,5.0,1,1,$12.99,English,English,1,Ala Yousef,


### Indexing and Retriving Data

Auf die Werte einer Spalte kann `<dataframe>['<spaltenname>']` zugegriffen werden.

In [43]:
data['Original rating']

0        4.600000
1        4.600000
2        4.600000
3        4.400000
4        4.600000
           ...   
79281    4.446485
79282    4.446485
79283    4.446485
79284    4.446485
79285    4.446485
Name: Original rating, Length: 79286, dtype: float64

Darauf können weitere Operationen oder Methoden angewendet werden:

In [46]:
data['Original rating'] + 10

0        14.600000
1        14.600000
2        14.600000
3        14.400000
4        14.600000
           ...    
79281    14.446485
79282    14.446485
79283    14.446485
79284    14.446485
79285    14.446485
Name: Original rating, Length: 79286, dtype: float64

In [47]:
data['Original rating'].mean()

4.2727264478795

Mehrere Spalten werden ausgewählt indem eine Liste von Spaltennamen übergeben wird

In [48]:
data[['Course Name', 'Categories', 'Original rating']]

Unnamed: 0,Course Name,Categories,Original rating
0,The Ultimate Drawing Course - Beginner to Adva...,"Graphic Design & Illustration,Drawing",4.600000
1,Character Art School: Complete Character Drawi...,"Design,Other Design,Character Design",4.600000
2,Complete Blender Creator: Learn 3D Modelling f...,"Design,3D & Animation,Blender",4.600000
3,Design Thinking in 3 Steps,"Design,User Experience Design,Design Thinking",4.400000
4,User Experience Design Essentials - Adobe XD U...,"Design,User Experience Design,User Interface",4.600000
...,...,...,...
79281,Limit Sugar Sweetened Beverages to Not More Th...,"Health & Fitness,Nutrition",4.446485
79282,Limit Processed Foods to Not More Than One Ser...,"Health & Fitness,Nutrition",4.446485
79283,how to be an expert in the word of bodybuilding,"Health & Fitness,Fitness,Health",4.446485
79284,Meditation - The Art of Inner Peace and Happin...,"Health & Fitness,Meditation",4.446485


### Boolean Indexing

Die ausgewählten Daten können auch gefilteret werden, in dem eine Bedingung mitgegeben wird.


In [50]:
data[data['Original rating'] >= 4.9]

Unnamed: 0,Course Name,Course URL,Categories,Short Description,Long Description,Difficulty,Duration,Free Option,Rating,Original rating,Numberofrated,Numberofenroll,Paid Option,Language,Subtitle Language,Platform,Provider,Image URL
146,The Ultimate 2D Game Character Design & Animat...,https://www.udemy.com/course/the-ultimate-2d-g...,"Design,3D & Animation,Character Design",Learn how to design and animate a character in...,,0,0,0,0.001683,4.9,1714,43065,NT$470,English,"English,Indonesian",1,Jaysen Batchelor,
233,Photoshop CS6 Crash Course,https://www.udemy.com/course/photoshop-cs6-cra...,"Design,Design Tools,Photoshop",Photoshop CS6 will be yours to command in 4 ho...,,0,0,0,0.001025,4.9,1044,11978,NT$470,English,English,1,Jeremy Shuback,
281,صناعة فيديوهات كارتون أنيميشن والتربح منها مثل...,https://www.udemy.com/course/osloop-2d-cartoon...,"Design,3D & Animation,2D Animation",Cartoon Animator 4 OsLoop Animation ٨٠ محاضرة ...,,0,1,0,0.000840,5.0,838,1866,S$59.98,English,English(default),1,"أسلوب OsLoop, Mohamed A.Karim, Pensee Fathallah",
356,WordPress: Create Stunning Wordpress Websites ...,https://www.udemy.com/course/wordpress-beginners/,"Design,Web Design,WordPress",Create an Amazing Professional Wordpress Websi...,,0,0,0,0.000604,4.9,615,33247,NT$470,English,English,1,"Diego Davila, Up Mind Courses",
420,Illustrator CC 2021 MasterClass : Be a Creativ...,https://www.udemy.com/course/illustrator-cc-ma...,"Design,Design Tools,Adobe Illustrator",Master Adobe Illustrator CC from Beginner to C...,,0,1,0,0.000536,4.9,546,2793,NT$470,English,English(default),1,Khalil Ibrahim,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
79235,Digital House Calls:,https://www.udemy.com/course/digital-house-calls/,"Health & Fitness,General Health,Healthcare",A Step-By-By Guide for Telemedicine Visits,,0,0,1,0.000001,5.0,1,111,0,English,English,1,"BridgingApps -, Jana Mitchell, Cathy Foreman",
79236,Upper body training. Japanese meta power trai...,https://www.udemy.com/course/upper-body-traini...,"Fitness,Health","very easy training, but highly effective.Teach...",,0,0,1,0.000001,5.0,1,329,0,English,English(default),1,Hina Toriya,
79239,what you need to know about tint prescribing a...,https://www.udemy.com/course/what-you-need-to-...,"Health & Fitness,General Health,Ophthalmology",tint prescribing and dispensing is currently a...,,0,0,1,0.000001,5.0,1,106,0,English,English,1,Ian Jordan,
79240,Occupational Therapy in a Justice-Based Setting,https://www.udemy.com/course/justice-based-occ...,"Health & Fitness,Other Health & Fitness,Occupa...",Creating a Qualitative Coding Method for a Jus...,,0,0,1,0.000001,5.0,1,191,0,English,English,1,Jessica Neff,


In [58]:
data[data['Provider'] == 'Alex Genadinik']

Unnamed: 0,Course Name,Course URL,Categories,Short Description,Long Description,Difficulty,Duration,Free Option,Rating,Original rating,Numberofrated,Numberofenroll,Paid Option,Language,Subtitle Language,Platform,Provider,Image URL
931,YouTube Thumbnail Image Design With Canva (Can...,https://www.udemy.com/course/beautiful-youtube...,"Design,Design Tools,Thumbnail Creation",Grow your YouTube channel with amazing YouTube...,,0,0,0,0.000180,4.5,200,31277,NT$470,English,English,1,Alex Genadinik,
1925,WordPress Plugin Business (No WordPress Plugin...,https://www.udemy.com/course/make-money-start-...,"Design,Web Design,WordPress Plugins",Start a WordPress plugin business! Learn WordP...,,0,0,0,0.000054,4.3,63,10670,NT$470,English,English,1,Alex Genadinik,
6630,How To Write A Business Plan And A Winning Bus...,https://www.udemy.com/course/how-to-write-a-bu...,"Business,Business Strategy,Business Plan",Business plan template with writing examples 2...,,0,0,0,0.001634,4.6,1772,24733,$15.99,English,"English,Italian",1,Alex Genadinik,
6830,Entrepreneurship: How To Start A Business From...,https://www.udemy.com/course/how-to-start-a-bu...,"Business,Entrepreneurship,Business Idea Genera...",Business fundamentals: Strategies to turn your...,,1,1,0,0.000899,4.3,1043,28079,$18.99,English,"English,Italian",1,Alex Genadinik,
6931,"2021 Selling On Amazon: Amazon SEO, Ads, Ecomm...",https://www.udemy.com/course/amazon-seo-ecomme...,"Business,E-Commerce,Selling on Amazon","Dominate Amazon eCommerce sales: Amazon SEO, r...",,0,0,0,0.000777,4.5,861,5017,$17.99,English,English,1,Alex Genadinik,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
61070,SEMRush Site Audit + Extended SEMRush Free Trial,https://www.udemy.com/course/semrush-course/,"Marketing,Search Engine Optimization,SEMrush",SEMRush is one of the top SEO industry tools. ...,,0,0,0,0.000005,4.5,6,174,$14.99,English,English,1,Alex Genadinik,
67881,"6 Ways To Make An Android, iPhone App With No ...",https://www.udemy.com/course/how-to-create-a-m...,"Development,Mobile Development,Mobile App Design",Create a mobile app (Android or iPhone) withou...,,0,0,0,0.000070,4.3,81,3625,NT$470,English,English,1,Alex Genadinik,
69149,2-Second Website Speed Optimization In 1 Day -...,https://www.udemy.com/course/improve-page-load...,"Development,Web Development,Conversion Rate Op...",Faster website Technical SEO tactics like webs...,,1,0,0,0.000034,3.4,50,317,NT$470,English,English,1,Alex Genadinik,
69979,Entrepreneurship For Engineers: Master Busines...,https://www.udemy.com/course/entrepreneurship-...,"Development,Web Development,Software Engineering","Techie, programmer, coder, IT or Computer Scie...",,0,0,0,0.000033,4.5,37,323,NT$470,English,English,1,Alex Genadinik,


Was ist die durchschnittliche Bewertung von Kursen die weniger als 100 eingeschriebene Menschen haben?

In [59]:
data['Numberofenroll'].describe()

count     79286.000000
mean       4650.728956
std       17653.776206
min           0.000000
25%          75.000000
50%         533.000000
75%        2985.750000
max      997885.000000
Name: Numberofenroll, dtype: float64

In [63]:
data[data['Numberofenroll'] < 75]

Unnamed: 0,Course Name,Course URL,Categories,Short Description,Long Description,Difficulty,Duration,Free Option,Rating,Original rating,Numberofrated,Numberofenroll,Paid Option,Language,Subtitle Language,Platform,Provider,Image URL
2808,Interactive Prototyping With Axure RP 8: Core ...,https://www.udemy.com/course/interactive-proto...,"Design,Design Tools,Axure RP",Learn core beginner and intermediate Axure ski...,,0,0,0,0.000028,4.700000,30,61,"NT$5,690",English,English,1,Debbie Levitt,
2818,Adobe Illustrator Beginner to Pro: Learn in an...,https://www.udemy.com/course/adobe-illustrator...,"Design,Design Tools,Adobe Illustrator",Adobe Illustrator introduction for complete be...,,0,0,0,0.000028,4.900000,29,65,NT$470,English,English,1,Rustam Khan,
3035,Pyware 3D Java Drill Design Beginner Tutorial ...,https://www.udemy.com/course/pyware-3d-java-dr...,"Other Design,Music Career",Setting you up for success with the basics of ...,,0,0,0,0.000024,4.800000,25,65,NT$470,English,English,1,Joseph Huls,
3070,Create Fall Guys Characters with Blender,https://www.udemy.com/course/create-fall-guys-...,"Design,Game Design,Character Design","Use Blender to 3D model, texture, rig, and ani...",,1,0,0,0.000022,4.500000,24,45,NT$470,English,English,1,Ali Bashir Salih,
3103,Rigorous Color Theory for Artists,https://www.udemy.com/course/color-theory-d/,"Other Design,Color Theory",Ability of effective color management in paint...,,1,0,0,0.000018,3.800000,24,67,NT$590,English,English,1,Qiang Huang,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
78739,80s Butt Thigh and Core Workout in the Water,https://www.udemy.com/course/80s-butt-thigh-an...,"Health & Fitness,Fitness",Water Workout in the Water Butt Thigh and Core,,0,0,0,0.000000,4.446485,0,55,$12.99,English,English(default),1,Laurie Active,
79244,Healing After a Miscarriage,https://www.udemy.com/course/healing-after-a-m...,"Health & Fitness,Mental Health,Spiritual Healing",Quality Support to Help You Through the Trauma...,,0,0,1,0.000000,4.446485,0,63,0,English,English,1,Bailey Gaddis,
79258,First do no harm; Baroness Cumberledge Report ...,https://www.udemy.com/course/first-do-no-harm-...,"Health & Fitness,General Health,Healthcare",Sodium Valproate,,0,0,1,0.000000,4.446485,0,67,0,English,English,1,Deborah Casey,
79270,HIIT Home Training for Weight Loss - All you n...,https://www.udemy.com/course/hiit-home-trainin...,"Health & Fitness,Fitness,HIIT","Gain Strength, Lose Weight, Get Fit, Feel Great!",,0,0,1,0.000000,4.446485,0,40,0,English,English,1,Ollie Clark,


### Recherchefragen

- Welche Kurskategorien sind am beliebtesten?
- Welche Sprachen sind verfügbar?
- Gibt es Top Teacher?
- Wie beeinflusst der Preis die Zahl der Studierenden?
- Wie viele Anfänger-, Fortgeschnittene- und Expertenkurse gibt es? Wie sind diese anteilmäßig verteilt?
- Beeinflusst die Länge die Bewertung?

### Recherchefragen

- Welche Kurskategorien sind am beliebtesten?
- Gibt es Top Teacher?
- Wie beeinflusst der Preis die Zahl der Studierenden?
- Beeinflusst die Länge die Bewertung?

Wie viele Anfänger-, Fortgeschnittene- und Expertenkurse gibt es? Wie sind diese anteilmäßig verteilt?

In [78]:
data['Difficulty'].value_counts()

0    67927
1     9888
2     1471
Name: Difficulty, dtype: int64

In [79]:
data['Difficulty'].value_counts(normalize=True)

0    0.856734
1    0.124713
2    0.018553
Name: Difficulty, dtype: float64

Welche Sprachen sind verfügbar?

In [65]:
data['Language'].value_counts()

English    79286
Name: Language, dtype: int64

--> Bedeutung für die Interpretierbarkeit der Daten?

Sprachen der verfügbaren Untertitel ???

In [76]:
data['Subtitle Language'].value_counts()

English                        63232
English(default)               13444
English,French                   941
English,Italian                  361
English,Indonesian               344
English,Spanish                  196
English,Portuguese               172
English,Arabic                   172
English,Polish                   144
English,German                    63
English,Turkish                   36
Italian                           25
English,Afrikaans                 20
English,Dutch                     14
English,Russian                   14
English,Simplified,Chinese        12
English,English                   11
English,Japanese                  10
English,Hindi                      8
French                             6
English,Greek                      6
English,Bengali                    5
Indonesian                         4
Spanish                            4
English,Korean                     3
Arabic,Danish                      3
German                             3
E

### Data Cleaning

English(default) zu English mithilfe einer Funktion

In [82]:
def clean_string(string):
    return string.replace('(default)', '')

In [83]:
clean_string('English(default)')

'English'

In [84]:
clean_string('Französisch (default)')

'Französisch '

Eine Funktion kann mit `<dataframe>[<spaltenname>].apply()` auf alle Zeilen eines Dataframses angewendet werden.

In [86]:
data['Subtitle Language'] = data['Subtitle Language'].apply(clean_string)

In [87]:
data['Subtitle Language'].value_counts()

English                        76676
English,French                   941
English,Italian                  361
English,Indonesian               344
English,Spanish                  196
English,Portuguese               172
English,Arabic                   172
English,Polish                   144
English,German                    63
English,Turkish                   36
Italian                           25
English,Afrikaans                 20
English,Dutch                     14
English,Russian                   14
English,Simplified,Chinese        12
English,English                   11
English,Japanese                  10
English,Hindi                      8
English,Greek                      6
French                             6
English,Bengali                    5
Indonesian                         4
Spanish                            4
English,Hebrew                     3
German                             3
French,Indonesian                  3
Arabic,Danish                      3
E

In [88]:
subtitles = data['Subtitle Language'].str.split(',')

subtitle_counts = {}

for course in subtitles:
    for language in course:
        if language not in subtitle_counts.keys():
            subtitle_counts[language] = 1
        else:
            subtitle_counts[language] = subtitle_counts[language] + 1
            
print(subtitle_counts)

{'English': 79233, 'French': 956, 'Italian': 391, 'Indonesian': 352, 'Portuguese': 179, 'Spanish': 204, 'Polish': 145, 'Arabic': 175, 'German': 69, 'Turkish': 37, 'Russian': 14, 'Danish': 4, 'Romanian': 2, 'Simplified': 12, 'Chinese': 14, 'Korean': 3, 'Traditional': 2, 'Hindi': 8, 'Dutch': 16, 'Japanese': 10, 'Afrikaans': 20, 'Galician': 1, 'Greek': 6, 'Hebrew': 3, 'Bengali': 5, 'Macedonian': 1, 'Croatian': 1, 'Bulgarian': 1, 'Filipino': 2}


In [92]:
sorted(subtitle_counts.items(), key=lambda kv: kv[1])

[('Galician', 1),
 ('Macedonian', 1),
 ('Croatian', 1),
 ('Bulgarian', 1),
 ('Romanian', 2),
 ('Traditional', 2),
 ('Filipino', 2),
 ('Korean', 3),
 ('Hebrew', 3),
 ('Danish', 4),
 ('Bengali', 5),
 ('Greek', 6),
 ('Hindi', 8),
 ('Japanese', 10),
 ('Simplified', 12),
 ('Russian', 14),
 ('Chinese', 14),
 ('Dutch', 16),
 ('Afrikaans', 20),
 ('Turkish', 37),
 ('German', 69),
 ('Polish', 145),
 ('Arabic', 175),
 ('Portuguese', 179),
 ('Spanish', 204),
 ('Indonesian', 352),
 ('Italian', 391),
 ('French', 956),
 ('English', 79233)]


Wie beliebt sind sehr lange Kurse?

Dataframe Duration:
categorizing rule:

    - 0 - kleiner als 10 Stunden
    - 1 - zwischen 10 - 50 Stunden
    - 2 - mehr als 50 Stunden

In [93]:
# Anzahl der Kurse je nach Länge
data['Duration'].value_counts()

0    70903
1     8071
2      312
Name: Duration, dtype: int64

In [95]:
# Gesmatzahl an Eingeschriebenen in Kursen mit über 50 Stunden Länge
data[data['Duration'] == 2]
data[data['Duration'] == 2]['Numberofenroll'].sum()

8259924

In [101]:
# Durchschnittliche Bewertung
data[data['Duration'] == 2]['Original rating'].mean()

4.3627561737179485

Welche Kurskategorien sind am beliebtesten?

# Zeit für Feedback



Link: https://ahaslides.com/QOCLW

![Feedback QR Code](../imgs/qrcode_vl5.png)

