### What are the different methods that data can be categorized?
- Structured data: tabular format
- Unstructured data: images, videos, audio files
- Semi-structured data: JSON, XML files, which have their own pre-defined structure
---
- Human generated: usually stored in databases; generated through the interaction of users with systems
- Machine generated: meta-data which holds information about the data, data about the data

### What are some sources of data?
- Relational databases, non-relational databases are a common source of enterprise data
- Data warehouse also stores data and enables big data
- Datasets, online datasets, from sources like Kaggle, or many other repositories
- Web scrapping: extract content / data from websites
- Sensors that capture data from the real world

### How many data types are there?
- Primitive data types:
- integer, character, float, double, boolean
- Non-primitive data types: more complex
- arrays, lists, binary trees, hash maps, strings, etc.
---
- Numerical (Quantitative): discrete, continuous
- Categorical (Qualitative): ordinal (binary), nominal

### What is the difference between structured, unstructured and semi-structured data? Mention some examples
- Structured data can be easily processed by algorithms, can apply batch processing, real-time processing, machine learning models, and more. (Ex.: DataFrame, relational db tables)
- Unstructured data need to be pre processed first, before feeding them into most algorithms. Because these can not directly understand the content of unstructured data. (Ex.: imgaes, video, audio)
- Semi-structured data holds some kind of structure, but still needs to be pre processed. (Ex.: JSON, XML, CSV)

In [1]:
n1, n2 = 13, 14
int(str(n1)[::-1]) + int(str(n2)[::-1])

72

In [2]:
import pandas as pd

# Example Series
data = {'A': [1,2,3,4], 'B':[5,6,7,8]}
df = pd.DataFrame(data)
df

Unnamed: 0,A,B
0,1,5
1,2,6
2,3,7
3,4,8


In [3]:
# Using map on a Series
df['A'] = df['A'].map({1:'one', 2:'two', 3:'three', 4:'four'}) # can affect only certain column
df

Unnamed: 0,A,B
0,one,5
1,two,6
2,three,7
3,four,8


In [4]:
df = pd.DataFrame(data)
df = df.applymap(lambda x : x*2) # will affect entire dataset
df

Unnamed: 0,A,B
0,2,10
1,4,12
2,6,14
3,8,16


In [5]:
df['A'] = df['A'].apply(lambda x : x * 2) # apply to only 1 col
df

Unnamed: 0,A,B
0,4,10
1,8,12
2,12,14
3,16,16


In [6]:
people = [{'first':'Jay','last':'Patel','email':'jay123@hotmail.com'},
          {'first':'Aanal','last':'Patel','email':'aanal123@femail.com'},
          {'first':'Bimal','last':'Stha','email':'bimal@hotmail.com'},
          {'first':'Danilo','last':'Kabita','email':'bro@ohnou.com'}]
people_df = pd.DataFrame(people)

people_df['email'].apply(len)

0    18
1    19
2    17
3    13
Name: email, dtype: int64

In [7]:
def update_email(email):
    return email.upper()

In [8]:
people_df['email'] = people_df['email'].apply(update_email)
people_df

Unnamed: 0,first,last,email
0,Jay,Patel,JAY123@HOTMAIL.COM
1,Aanal,Patel,AANAL123@FEMAIL.COM
2,Bimal,Stha,BIMAL@HOTMAIL.COM
3,Danilo,Kabita,BRO@OHNOU.COM


In [9]:
people_df['full_name'] = people_df['first'] + ' ' + people_df['last']
people_df

Unnamed: 0,first,last,email,full_name
0,Jay,Patel,JAY123@HOTMAIL.COM,Jay Patel
1,Aanal,Patel,AANAL123@FEMAIL.COM,Aanal Patel
2,Bimal,Stha,BIMAL@HOTMAIL.COM,Bimal Stha
3,Danilo,Kabita,BRO@OHNOU.COM,Danilo Kabita


In [10]:
people_df.drop(columns = ['first', 'last'])

Unnamed: 0,email,full_name
0,JAY123@HOTMAIL.COM,Jay Patel
1,AANAL123@FEMAIL.COM,Aanal Patel
2,BIMAL@HOTMAIL.COM,Bimal Stha
3,BRO@OHNOU.COM,Danilo Kabita


In [11]:
people_df.drop(columns = ['first', 'last'], inplace = True)
people_df

Unnamed: 0,email,full_name
0,JAY123@HOTMAIL.COM,Jay Patel
1,AANAL123@FEMAIL.COM,Aanal Patel
2,BIMAL@HOTMAIL.COM,Bimal Stha
3,BRO@OHNOU.COM,Danilo Kabita


In [13]:
people_df['full_name'].str.split(' ')

0        [Jay, Patel]
1      [Aanal, Patel]
2       [Bimal, Stha]
3    [Danilo, Kabita]
Name: full_name, dtype: object

In [25]:
people_df[['first', 'last']] = people_df['full_name'].str.split(' ', expand = True)
people_df

Unnamed: 0,email,full_name,first,last
0,JAY123@HOTMAIL.COM,Jay Patel,Jay,Patel
1,AANAL123@FEMAIL.COM,Aanal Patel,Aanal,Patel
2,BIMAL@HOTMAIL.COM,Bimal Stha,Bimal,Stha
3,BRO@OHNOU.COM,Danilo Kabita,Danilo,Kabita


In [26]:
people_df = pd.concat([people_df, pd.DataFrame(data=[{'first':'Tony'}])], ignore_index=True)

In [27]:
people_df

Unnamed: 0,email,full_name,first,last
0,JAY123@HOTMAIL.COM,Jay Patel,Jay,Patel
1,AANAL123@FEMAIL.COM,Aanal Patel,Aanal,Patel
2,BIMAL@HOTMAIL.COM,Bimal Stha,Bimal,Stha
3,BRO@OHNOU.COM,Danilo Kabita,Danilo,Kabita
4,,,Tony,


In [29]:
people_df = people_df.drop(index=4)
people_df

Unnamed: 0,email,full_name,first,last
0,JAY123@HOTMAIL.COM,Jay Patel,Jay,Patel
1,AANAL123@FEMAIL.COM,Aanal Patel,Aanal,Patel
2,BIMAL@HOTMAIL.COM,Bimal Stha,Bimal,Stha
3,BRO@OHNOU.COM,Danilo Kabita,Danilo,Kabita


In [31]:
people_df.sort_values(by='last', ascending=False)

Unnamed: 0,email,full_name,first,last
2,BIMAL@HOTMAIL.COM,Bimal Stha,Bimal,Stha
0,JAY123@HOTMAIL.COM,Jay Patel,Jay,Patel
1,AANAL123@FEMAIL.COM,Aanal Patel,Aanal,Patel
3,BRO@OHNOU.COM,Danilo Kabita,Danilo,Kabita


In [33]:
people_df.sort_values(by=['last', 'first'], ascending=[False, True])

Unnamed: 0,email,full_name,first,last
2,BIMAL@HOTMAIL.COM,Bimal Stha,Bimal,Stha
1,AANAL123@FEMAIL.COM,Aanal Patel,Aanal,Patel
0,JAY123@HOTMAIL.COM,Jay Patel,Jay,Patel
3,BRO@OHNOU.COM,Danilo Kabita,Danilo,Kabita


In [34]:
df = pd.read_csv('survey_results_public.csv')
df.tail(2)

Unnamed: 0,ResponseId,Q120,MainBranch,Age,Employment,RemoteWork,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,...,Frequency_1,Frequency_2,Frequency_3,TimeSearching,TimeAnswering,ProfessionalTech,Industry,SurveyLength,SurveyEase,ConvertedCompYearly
89182,89183,I agree,I am a developer by profession,Under 18 years old,"Employed, part-time;Student, part-time","Hybrid (some remote, some in-person)",Hobby;School or academic work,"Secondary school (e.g. American high school, G...",Online Courses or Certification;Other online r...,Formal documentation provided by the owner of ...,...,,,,,,,,Appropriate in length,Neither easy nor difficult,
89183,89184,I agree,I am a developer by profession,35-44 years old,"Employed, full-time","Hybrid (some remote, some in-person)",Hobby;Professional development or self-paced l...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Colleague;Online Courses or Certification;Othe...,Formal documentation provided by the owner of ...,...,Never,1-2 times a week,1-2 times a week,60-120 minutes a day,30-60 minutes a day,DevOps function;Developer portal or other cent...,"Information Services, IT, Software Development...",Appropriate in length,Easy,


In [39]:
df.sort_values(by=['Country', 'ConvertedCompYearly'], ascending=[True, False]).head(5)[['Country', 'ConvertedCompYearly']]

Unnamed: 0,Country,ConvertedCompYearly
88347,Afghanistan,9203683.0
86038,Afghanistan,68000.0
1334,Afghanistan,2403.0
57361,Afghanistan,2288.0
17872,Afghanistan,1144.0


In [40]:
df['ConvertedCompYearly'].nlargest(5)

53268    74351432.0
77848    73607918.0
66223    72714292.0
28121    57513831.0
19679    36573181.0
Name: ConvertedCompYearly, dtype: float64

In [41]:
df['ConvertedCompYearly'].nsmallest(5)

2243     1.0
3929     1.0
5724     1.0
8862     1.0
13708    1.0
Name: ConvertedCompYearly, dtype: float64

In [42]:
df['ConvertedCompYearly'].mean()

103110.08171765343

In [45]:
df.describe()

Unnamed: 0,ResponseId,CompTotal,WorkExp,ConvertedCompYearly
count,89184.0,48225.0,43579.0,48019.0
mean,44592.5,1.036807e+42,11.405126,103110.1
std,25745.347541,2.276847e+44,9.051989,681418.8
min,1.0,0.0,0.0,1.0
25%,22296.75,63000.0,5.0,43907.0
50%,44592.5,115000.0,9.0,74963.0
75%,66888.25,230000.0,16.0,121641.0
max,89184.0,5e+46,50.0,74351430.0


In [49]:
df['Employment'].value_counts(dropna=False) * 100

Employment
Employed, full-time                                                                                                                                                                                             60.266416
Student, full-time                                                                                                                                                                                               8.331091
Independent contractor, freelancer, or self-employed                                                                                                                                                             7.934159
Employed, full-time;Independent contractor, freelancer, or self-employed                                                                                                                                         4.882042
Not employed, but looking for work                                                                                   

In [50]:
df['RemoteWork'].value_counts(dropna=False, normalize=True) * 100

RemoteWork
Hybrid (some remote, some in-person)    34.906485
Remote                                  34.272964
NaN                                     17.238518
In-person                               13.582033
Name: proportion, dtype: float64

In [51]:
df['Country'].value_counts(dropna=False, normalize=True) * 100

Country
United States of America                                20.908459
Germany                                                  8.216720
India                                                    6.307185
United Kingdom of Great Britain and Northern Ireland     6.225332
Canada                                                   3.932320
                                                          ...    
Saint Kitts and Nevis                                    0.001121
Marshall Islands                                         0.001121
Samoa                                                    0.001121
Central African Republic                                 0.001121
San Marino                                               0.001121
Name: proportion, Length: 186, dtype: float64

In [57]:
filt = (df['Country'] == 'Canada')
df.loc[filt]['RemoteWork'].value_counts(dropna=False, normalize=True)*100

RemoteWork
Remote                                  46.307385
Hybrid (some remote, some in-person)    31.023667
NaN                                     15.027089
In-person                                7.641859
Name: proportion, dtype: float64