# Pandas Datatype
## pandas.Series
1. strictly homo and 1-D data structures
2. No need to apply loops and if for conditions
3. uses [] to access elements by indexes
4. indexes can be numbers or text

## pandas.DataFrame
1. 2-D data structures
2. combo of multiple series
3. Equiv to regular tables

In [2]:
import pandas as pd

In [3]:
l1 = [12,34,56,32,12,3,4,57,54,46,56,78,99,100,23]
print(l1)

ser1 = pd.Series(l1)
print(ser1)

[12, 34, 56, 32, 12, 3, 4, 57, 54, 46, 56, 78, 99, 100, 23]
0      12
1      34
2      56
3      32
4      12
5       3
6       4
7      57
8      54
9      46
10     56
11     78
12     99
13    100
14     23
dtype: int64


In [4]:
# for pd.Series, indexes can also be user defined
ser2 = pd.Series(l1, index= range(1,len(l1)+1))
# index takes a list as input - 
#        i. elements should be unique
#        ii. the size of the list should be same as the size of the series
print(ser2)

1      12
2      34
3      56
4      32
5      12
6       3
7       4
8      57
9      54
10     46
11     56
12     78
13     99
14    100
15     23
dtype: int64


In [5]:
ser3 = pd.Series(l1, index = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 0, 'j', 23, 2, 'word_idx', 'z', 1])
print(ser3)

a            12
b            34
c            56
d            32
e            12
f             3
g             4
h            57
0            54
j            46
23           56
2            78
word_idx     99
z           100
1            23
dtype: int64


index to series can be 

1. Default (.iloc[]) 0 to n-1. It stays forever internally. If user doesnt define its own index, then both int and ext idx are from 0 to n-1, n:number of elements in series
2. User defined (.loc[]). Ext index is same as of user defined. Int idx is 0 to n-1. loc supports internal index but if ext idx is defined, then loc wont support ext idx. Hence, unless ext idx is defined, loc is preffered for both row and col fetching

In [6]:
ser4 = pd.Series(l1, index=['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o'])
ser4.loc['a']
# ser4.loc[0]       it wont work

12

In [7]:
print(l1[3])
# ser2
# access fourth element - element with def index as 3
print(ser3.iloc[3], ser3.loc[0], ser3[1])
print(ser3['word_idx'])

32
32 54 23
99


In [8]:
print(ser1.iloc[0:8])   # rhl will be omitted. ser1 dont have ext idx
print(ser2.iloc[0:8])   # rhl will be omitted. ser2 have ext idx
# both will give seven rows, but the internal indexing is same in both series. On printing, ext idx overrides to display in ser2.

0    12
1    34
2    56
3    32
4    12
5     3
6     4
7    57
dtype: int64
1    12
2    34
3    56
4    32
5    12
6     3
7     4
8    57
dtype: int64


In [9]:
print(ser2.iloc[0:80])  # only int idx which are valid within range wll be displayed without errors

1      12
2      34
3      56
4      32
5      12
6       3
7       4
8      57
9      54
10     46
11     56
12     78
13     99
14    100
15     23
dtype: int64


In [10]:
print(ser2.loc[1:9]) # Contrary to the regular 1,2,3,4,5,6,7,8 a colon op in .loc will get 1,2,3,4,5,6,7,8,9
print(ser2.loc[0:90])  # only the valid row with ext idx within the range will be displayed without throwing any errors

1    12
2    34
3    56
4    32
5    12
6     3
7     4
8    57
9    54
dtype: int64
1      12
2      34
3      56
4      32
5      12
6       3
7       4
8      57
9      54
10     46
11     56
12     78
13     99
14    100
15     23
dtype: int64


#### Applying conditions to series

In [11]:
ser1 > 30

0     False
1      True
2      True
3      True
4     False
5     False
6     False
7      True
8      True
9      True
10     True
11     True
12     True
13     True
14    False
dtype: bool

In [12]:
ser1[ser1 > 30]

1      34
2      56
3      32
7      57
8      54
9      46
10     56
11     78
12     99
13    100
dtype: int64

In [13]:
ser1[(ser1 > 30) & (ser1 < 80)] # comparision happens on series

#and &
#or  |
#not ~

1     34
2     56
3     32
7     57
8     54
9     46
10    56
11    78
dtype: int64

In [14]:
ser1.loc[(ser1 > 30) & (ser1 < 80)]
# mentioning .loc is the conventional way of applying the conditions, because loc can take boolean values. iloc cant take boolean values.

1     34
2     56
3     32
7     57
8     54
9     46
10    56
11    78
dtype: int64

### Functions associated with series

In [15]:
print(dir(ser1))

['T', '_AXIS_LEN', '_AXIS_ORDERS', '_AXIS_REVERSED', '_AXIS_TO_AXIS_NUMBER', '_HANDLED_TYPES', '__abs__', '__add__', '__and__', '__annotations__', '__array__', '__array_priority__', '__array_ufunc__', '__array_wrap__', '__bool__', '__class__', '__contains__', '__copy__', '__deepcopy__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__divmod__', '__doc__', '__eq__', '__finalize__', '__float__', '__floordiv__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__iadd__', '__iand__', '__ifloordiv__', '__imod__', '__imul__', '__init__', '__init_subclass__', '__int__', '__invert__', '__ior__', '__ipow__', '__isub__', '__iter__', '__itruediv__', '__ixor__', '__le__', '__len__', '__long__', '__lt__', '__matmul__', '__mod__', '__module__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__or__', '__pos__', '__pow__', '__radd__', '__rand__', '__rdivmod__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__'

In [16]:
ser1.sum()

666

In [17]:
ser1.mean() 

44.4

In [18]:
ser1.median()

46.0

In [19]:
ser1.mode() # always gives a dataframe, even if only one mode is possible

0    12
1    56
dtype: int64

In [20]:
ser1.min()

3

In [21]:
ser1.max()

100

In [74]:
ser1.quantile([0,0.25,0.5,0.75,0.65,1]) # gives percentile values, not just limited to quantiles and min&max

0.00      3.0
0.25     17.5
0.50     46.0
0.75     56.5
0.65     56.0
1.00    100.0
dtype: float64

In [23]:
ser1.describe()

count     15.000000
mean      44.400000
std       31.497846
min        3.000000
25%       17.500000
50%       46.000000
75%       56.500000
max      100.000000
dtype: float64

In [24]:
#  pd.to_numeric(ser1)        any data type to numbers
#  pd.to_datetime()       suitable characters to date type
# --------------------------------------------------------------
# dataframe.astype(str)   anything(DF or S or DF.S.....) to string (object type)
# dataframe.astype(bool)  anything(DF or S or DF.col...) to bool

In [25]:
ser1.astype(bool)

0     True
1     True
2     True
3     True
4     True
5     True
6     True
7     True
8     True
9     True
10    True
11    True
12    True
13    True
14    True
dtype: bool

In [26]:
ser1.astype(int)

0      12
1      34
2      56
3      32
4      12
5       3
6       4
7      57
8      54
9      46
10     56
11     78
12     99
13    100
14     23
dtype: int64

In [27]:
ser1.astype(float)

0      12.0
1      34.0
2      56.0
3      32.0
4      12.0
5       3.0
6       4.0
7      57.0
8      54.0
9      46.0
10     56.0
11     78.0
12     99.0
13    100.0
14     23.0
dtype: float64

In [28]:
ser1.astype(object)

0      12
1      34
2      56
3      32
4      12
5       3
6       4
7      57
8      54
9      46
10     56
11     78
12     99
13    100
14     23
dtype: object

In [29]:
ser1.clip?

[0;31mSignature:[0m
[0mser1[0m[0;34m.[0m[0mclip[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mlower[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mupper[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0maxis[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0minplace[0m[0;34m:[0m [0;34m'bool_t'[0m [0;34m=[0m [0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0margs[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m**[0m[0mkwargs[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m [0;34m->[0m [0;34m'FrameOrSeries'[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Trim values at input threshold(s).

Assigns values outside boundary to boundary values. Thresholds
can be singular values or array like, and in the latter case
the clipping is performed element-wise in the specified axis.

Parameters
----------
lower : float or array_like, default None
    Minimum threshold value. All valu

In [30]:
df = pd.Series([4,4,4,6,8,4,10,15,6,15,15])
df.duplicated()

0     False
1      True
2      True
3     False
4     False
5      True
6     False
7     False
8      True
9      True
10     True
dtype: bool

In [31]:
# 5 rows max by default
print(ser1.head(3))
print(ser1.tail(3))

0    12
1    34
2    56
dtype: int64
12     99
13    100
14     23
dtype: int64


In [32]:
print(ser1.index)
print(ser3.index)
print(ser1.index.tolist())
print(ser3.index.tolist())
print(ser3.tolist())

RangeIndex(start=0, stop=15, step=1)
Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 0, 'j', 23, 2, 'word_idx', 'z',
       1],
      dtype='object')
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 0, 'j', 23, 2, 'word_idx', 'z', 1]
[12, 34, 56, 32, 12, 3, 4, 57, 54, 46, 56, 78, 99, 100, 23]


In [33]:
import os

In [34]:
os.getcwd()

'/home/sharan/projects/al/pds'

In [35]:
ser3.to_csv("ser3.csv")

## DataFrames

In [36]:
df = pd.read_csv('./sosurveydataset/survey_results_public.csv')
# here we can write second param as <index_col='Respondent'> as it is eligible for being index. 
# we can also provide normally ext idx as <index=[..,..,..,..,......]>

In [76]:
pd.set_option('display.max_columns', 12)
pd.set_option('display.max_rows', 10)

In [77]:
df

Unnamed: 0,Respondent,MainBranch,Hobbyist,Age,Age1stCode,CompFreq,...,WebframeDesireNextYear,WebframeWorkedWith,WelcomeChange,WorkWeekHrs,YearsCode,YearsCodePro
0,1.0,I am a developer by profession,Yes,,13,Monthly,...,ASP.NET Core,ASP.NET;ASP.NET Core,Just as welcome now as I felt last year,50.0,36,27
1,2.0,I am a developer by profession,No,,19,,...,,,Somewhat more welcome now than last year,,7,4
2,3.0,I code primarily as a hobby,Yes,,15,,...,,,Somewhat more welcome now than last year,,4,
3,4.0,I am a developer by profession,Yes,25.0,18,,...,,,Somewhat less welcome now than last year,40.0,7,4
4,5.0,"I used to be a developer by profession, but no...",Yes,31.0,16,,...,Django;Ruby on Rails,Ruby on Rails,Just as welcome now as I felt last year,,15,8
...,...,...,...,...,...,...,...,...,...,...,...,...,...
64456,64858.0,,Yes,,16,,...,,,,,10,Less than 1 year
64457,64867.0,,Yes,,,,...,,,,,,
64458,64898.0,,Yes,,,,...,,,,,,
64459,64925.0,,Yes,,,,...,Angular;Angular.js;React.js,,,,,


In [39]:
df.info()
# 1. Row - 32 entries with 0 - 31 and Data columns (total 15 columns)
# 2. All columns - followed by it's data types
        # object - strings
        # number of non missing values in that column
# 3. Freq table of the data types

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64461 entries, 0 to 64460
Data columns (total 61 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Respondent                    64461 non-null  int64  
 1   MainBranch                    64162 non-null  object 
 2   Hobbyist                      64416 non-null  object 
 3   Age                           45446 non-null  float64
 4   Age1stCode                    57900 non-null  object 
 5   CompFreq                      40069 non-null  object 
 6   CompTotal                     34826 non-null  float64
 7   ConvertedComp                 34756 non-null  float64
 8   Country                       64072 non-null  object 
 9   CurrencyDesc                  45472 non-null  object 
 10  CurrencySymbol                45472 non-null  object 
 11  DatabaseDesireNextYear        44070 non-null  object 
 12  DatabaseWorkedWith            49537 non-null  object 
 13  D

In [40]:
df.columns.tolist() # there is nothing like df.rows method

['Respondent',
 'MainBranch',
 'Hobbyist',
 'Age',
 'Age1stCode',
 'CompFreq',
 'CompTotal',
 'ConvertedComp',
 'Country',
 'CurrencyDesc',
 'CurrencySymbol',
 'DatabaseDesireNextYear',
 'DatabaseWorkedWith',
 'DevType',
 'EdLevel',
 'Employment',
 'Ethnicity',
 'Gender',
 'JobFactors',
 'JobSat',
 'JobSeek',
 'LanguageDesireNextYear',
 'LanguageWorkedWith',
 'MiscTechDesireNextYear',
 'MiscTechWorkedWith',
 'NEWCollabToolsDesireNextYear',
 'NEWCollabToolsWorkedWith',
 'NEWDevOps',
 'NEWDevOpsImpt',
 'NEWEdImpt',
 'NEWJobHunt',
 'NEWJobHuntResearch',
 'NEWLearn',
 'NEWOffTopic',
 'NEWOnboardGood',
 'NEWOtherComms',
 'NEWOvertime',
 'NEWPurchaseResearch',
 'NEWPurpleLink',
 'NEWSOSites',
 'NEWStuck',
 'OpSys',
 'OrgSize',
 'PlatformDesireNextYear',
 'PlatformWorkedWith',
 'PurchaseWhat',
 'Sexuality',
 'SOAccount',
 'SOComm',
 'SOPartFreq',
 'SOVisitFreq',
 'SurveyEase',
 'SurveyLength',
 'Trans',
 'UndergradMajor',
 'WebframeDesireNextYear',
 'WebframeWorkedWith',
 'WelcomeChange',
 'W

In [41]:
df.describe()   # cols only with numerical values will be taken into cinsideration. Rows with NaN will be excluded.

Unnamed: 0,Respondent,Age,CompTotal,ConvertedComp,WorkWeekHrs
count,64461.0,45446.0,34826.0,34756.0,41151.0
mean,32554.079738,30.834111,3.190464e+242,103756.1,40.782174
std,18967.44236,9.585392,inf,226885.3,17.816383
min,1.0,1.0,0.0,0.0,1.0
25%,16116.0,24.0,20000.0,24648.0,40.0
50%,32231.0,29.0,63000.0,54049.0,40.0
75%,49142.0,35.0,125000.0,95000.0,44.0
max,65639.0,279.0,1.1111110000000001e+247,2000000.0,475.0


In [42]:
# for dataframes when used with loc or iloc, rows and cols both are required inside []
# For using loc and iloc, it's nesessary to pass both column and row
# [r,c] -> left of comma - row indexes/conditions and right is for column names

df.loc[0:15,['Respondent', 'Age', 'CompTotal']] # except idx, everything is compulsory

Unnamed: 0,Respondent,Age,CompTotal
0,1,,
1,2,,
2,3,,
3,4,25.0,
4,5,31.0,
...,...,...,...
11,12,49.0,1100.0
12,13,53.0,3000.0
13,14,27.0,66000.0
14,15,,


In [43]:
# Try to extract columns

# 1. Use of the . operator
print(df.YearsCode)
print(df['YearsCode'])  # prefer [] over . because if col named count, it will overridden by the method count
print(df[['Respondent', 'Age', 'CompTotal']])

0         36
1          7
2          4
3          7
4         15
        ... 
64456     10
64457    NaN
64458    NaN
64459    NaN
64460    NaN
Name: YearsCode, Length: 64461, dtype: object
0         36
1          7
2          4
3          7
4         15
        ... 
64456     10
64457    NaN
64458    NaN
64459    NaN
64460    NaN
Name: YearsCode, Length: 64461, dtype: object
       Respondent   Age  CompTotal
0               1   NaN        NaN
1               2   NaN        NaN
2               3   NaN        NaN
3               4  25.0        NaN
4               5  31.0        NaN
...           ...   ...        ...
64456       64858   NaN        NaN
64457       64867   NaN        NaN
64458       64898   NaN        NaN
64459       64925   NaN        NaN
64460       65112   NaN        NaN

[64461 rows x 3 columns]


In [44]:
#  pd.to_numeric(ser1)        any data type to numbers
#  pd.to_datetime()       suitable characters to date type
# --------------------------------------------------------------
# dataframe.astype(str)   anything to string (object type)
# dataframe.astype(bool)  anything to bool

In [45]:
df['Age'] = pd.to_numeric(df.Age)

In [46]:
df['CompTotal'] = df.CompTotal.astype(bool)

In [47]:
df.CompTotal

0        True
1        True
2        True
3        True
4        True
         ... 
64456    True
64457    True
64458    True
64459    True
64460    True
Name: CompTotal, Length: 64461, dtype: bool

In [48]:
df.Respondent = df['Respondent'].astype(float)

In [49]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64461 entries, 0 to 64460
Data columns (total 61 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Respondent                    64461 non-null  float64
 1   MainBranch                    64162 non-null  object 
 2   Hobbyist                      64416 non-null  object 
 3   Age                           45446 non-null  float64
 4   Age1stCode                    57900 non-null  object 
 5   CompFreq                      40069 non-null  object 
 6   CompTotal                     64461 non-null  bool   
 7   ConvertedComp                 34756 non-null  float64
 8   Country                       64072 non-null  object 
 9   CurrencyDesc                  45472 non-null  object 
 10  CurrencySymbol                45472 non-null  object 
 11  DatabaseDesireNextYear        44070 non-null  object 
 12  DatabaseWorkedWith            49537 non-null  object 
 13  D

In [50]:
# date
# day:          %d
# dayofweek:    %w
# weekdayabbr:  %a
# weekdayfulln  %A
# monthnum      %m
# monthnameabbr %b
# monthnamefull %B
# yearwocent    %y
# yearwcent     %Y

In [51]:
date1 = "25Feb2018"
date2 = "1/1/18"
date3 = "01/23/1986"
date4 = "Sunday 25 Feb 2018"

In [52]:
pd.to_datetime(date1,format="%d%b%Y")

Timestamp('2018-02-25 00:00:00')

In [53]:
d = pd.to_datetime(date3,format="%m/%d/%Y")
print(d)

1986-01-23 00:00:00


In [54]:
type(d)

pandas._libs.tslibs.timestamps.Timestamp

##  Corey's

In [119]:
people = {
    'first' : ['Corey', 'Jane', 'John'],
    'Last' : ['Schafer', 'Doe', 'Doe'],
    'email' : ['CoreySchafer@gmail.com', 'JaneDoe@gmail.com', 'JohnDoe@gmail.com']
}

In [120]:
df1 = pd.DataFrame(people)

In [121]:
df1.shape    # gives dimension. NOTE, its w/o brackets

(3, 3)

In [122]:
df1

Unnamed: 0,first,Last,email
0,Corey,Schafer,CoreySchafer@gmail.com
1,Jane,Doe,JaneDoe@gmail.com
2,John,Doe,JohnDoe@gmail.com


In [123]:
# index to dataframe

df1.loc[0] # or df.iloc[0], both will give same output. Here, for single row, column idx is idx of the output

first                     Corey
Last                    Schafer
email    CoreySchafer@gmail.com
Name: 0, dtype: object

In [124]:
# for multiple rows if to be fetched, then normal output is observed

df1.iloc[[1,2], 0]   # from " , 0" is optional. For multiple rows/cols [] is required, but for single row/col only that row's/col's idx is required...  Works both for iloc and loc, but remember that loc wont take int idx if ext idx is defined. Here col headers are ext idx which are implicitly defined.
# Syntax:       .[i]loc[[row[,row,row,...]][,[col[,col,col,...]] ]

1    Jane
2    John
Name: first, dtype: object

In [125]:
df1.columns

Index(['first', 'Last', 'email'], dtype='object')

In [126]:
df['Hobbyist'].value_counts()   # gives the frequency of the unique values 

Yes    50388
No     14028
Name: Hobbyist, dtype: int64

In [127]:
# making the index col by one series of dataframe

print(df1)
print(df1.index)
df1.set_index('email')
print(df1)   # inplace change didn't took place
print(df1.index)
df1.set_index('email', inplace=True)
print(df1)
print(df1.index)
df1.reset_index(inplace=True)
print(df1)
print(df1.index)

   first     Last                   email
0  Corey  Schafer  CoreySchafer@gmail.com
1   Jane      Doe       JaneDoe@gmail.com
2   John      Doe       JohnDoe@gmail.com
RangeIndex(start=0, stop=3, step=1)
   first     Last                   email
0  Corey  Schafer  CoreySchafer@gmail.com
1   Jane      Doe       JaneDoe@gmail.com
2   John      Doe       JohnDoe@gmail.com
RangeIndex(start=0, stop=3, step=1)
                        first     Last
email                                 
CoreySchafer@gmail.com  Corey  Schafer
JaneDoe@gmail.com        Jane      Doe
JohnDoe@gmail.com        John      Doe
Index(['CoreySchafer@gmail.com', 'JaneDoe@gmail.com', 'JohnDoe@gmail.com'], dtype='object', name='email')
                    email  first     Last
0  CoreySchafer@gmail.com  Corey  Schafer
1       JaneDoe@gmail.com   Jane      Doe
2       JohnDoe@gmail.com   John      Doe
RangeIndex(start=0, stop=3, step=1)


In [128]:
# sorting rows based on index
df2=pd.read_csv('./sosurveydataset/survey_results_schema.csv', index_col='Column')

In [129]:
df2

Unnamed: 0_level_0,QuestionText
Column,Unnamed: 1_level_1
Respondent,Randomized respondent ID number (not in order ...
MainBranch,Which of the following options best describes ...
Hobbyist,Do you code as a hobby?
Age,What is your age (in years)? If you prefer not...
Age1stCode,At what age did you write your first line of c...
...,...
WebframeWorkedWith,Which web frameworks have you done extensive d...
WelcomeChange,"Compared to last year, how welcome do you feel..."
WorkWeekHrs,"On average, how many hours per week do you wor..."
YearsCode,"Including any education, how many years have y..."


In [130]:
df2.sort_index()    # this didn't took inplace change. For that, use <inplace=True> as parameter to the sort_index() method

Unnamed: 0_level_0,QuestionText
Column,Unnamed: 1_level_1
Age,What is your age (in years)? If you prefer not...
Age1stCode,At what age did you write your first line of c...
CompFreq,"Is that compensation weekly, monthly, or yearly?"
CompTotal,What is your current total compensation (salar...
ConvertedComp,Salary converted to annual USD salaries using ...
...,...
WebframeWorkedWith,Which web frameworks have you done extensive d...
WelcomeChange,"Compared to last year, how welcome do you feel..."
WorkWeekHrs,"On average, how many hours per week do you wor..."
YearsCode,"Including any education, how many years have y..."


In [131]:
df.loc[df['ConvertedComp'] > 70000, ['ConvertedComp', 'Country']]    # comparision happens on series

Unnamed: 0,ConvertedComp,Country
7,116000.0,United States
15,108576.0,United Kingdom
16,79000.0,United States
17,1260000.0,United States
18,83400.0,United States
...,...,...
64113,225000.0,United States
64116,150000.0,United States
64127,140000.0,United States
64129,150000.0,United States


In [132]:
countries = ['United States', 'India', 'United Kingdom', 'Germany', 'Canada']
filt = df['Country'].isin(countries)
df.loc[filt]

Unnamed: 0,Respondent,MainBranch,Hobbyist,Age,Age1stCode,CompFreq,...,WebframeDesireNextYear,WebframeWorkedWith,WelcomeChange,WorkWeekHrs,YearsCode,YearsCodePro
0,1.0,I am a developer by profession,Yes,,13,Monthly,...,ASP.NET Core,ASP.NET;ASP.NET Core,Just as welcome now as I felt last year,50.0,36,27
1,2.0,I am a developer by profession,No,,19,,...,,,Somewhat more welcome now than last year,,7,4
4,5.0,"I used to be a developer by profession, but no...",Yes,31.0,16,,...,Django;Ruby on Rails,Ruby on Rails,Just as welcome now as I felt last year,,15,8
5,6.0,I am a developer by profession,No,,14,,...,React.js,,,,6,4
6,7.0,I am a developer by profession,Yes,,18,Monthly,...,,,A lot more welcome now than last year,,6,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...
64441,62834.0,,Yes,17.0,10,,...,,,Somewhat more welcome now than last year,,7,
64442,62954.0,,Yes,,,,...,,ASP.NET;ASP.NET Core;Django;jQuery;Symfony;Vue.js,,,,
64443,63077.0,,Yes,,20,,...,,,,,4,
64452,64236.0,,Yes,,,,...,,,,,,


In [133]:
filt = df['LanguageWorkedWith'].str.contains('Python', na=False)
df.loc[filt, 'LanguageWorkedWith']

2                                 Objective-C;Python;Swift
7                                               Python;SQL
9                      HTML/CSS;Java;JavaScript;Python;SQL
12                                     C;JavaScript;Python
14        Bash/Shell/PowerShell;C;HTML/CSS;Java;Python;SQL
                               ...                        
64433    Bash/Shell/PowerShell;HTML/CSS;JavaScript;Perl...
64438       C++;HTML/CSS;JavaScript;Python;Ruby;TypeScript
64443              C++;HTML/CSS;Java;JavaScript;Python;SQL
64446    Bash/Shell/PowerShell;C;C#;C++;HTML/CSS;Java;J...
64457    Assembly;Bash/Shell/PowerShell;C;C#;C++;Dart;G...
Name: LanguageWorkedWith, Length: 25287, dtype: object

## Updating column names of DFs

In [134]:
# changing all columns name inplace by default
print(df1)
df1.columns = ['Email', 'First Name', 'Last Name'] # this methos requires all column of DFs, even if all name changes are not required
print(df1.columns)
df1.columns = [x.upper() for x in df1.columns]
print(df1.columns)
df1.columns = df1.columns.str.replace(' ','-')
print(df1.columns)
df1.columns = [x.lower().replace('-','_') for x in df1.columns]

                    email  first     Last
0  CoreySchafer@gmail.com  Corey  Schafer
1       JaneDoe@gmail.com   Jane      Doe
2       JohnDoe@gmail.com   John      Doe
Index(['Email', 'First Name', 'Last Name'], dtype='object')
Index(['EMAIL', 'FIRST NAME', 'LAST NAME'], dtype='object')
Index(['EMAIL', 'FIRST-NAME', 'LAST-NAME'], dtype='object')


In [135]:
df1.columns

Index(['email', 'first_name', 'last_name'], dtype='object')

In [136]:
# changing specific column name with explicitly mentioning inplace parameter
df1.rename(columns={'first_name':'first', 'last_name':'last'}, inplace=True)
df1

Unnamed: 0,email,first,last
0,CoreySchafer@gmail.com,Corey,Schafer
1,JaneDoe@gmail.com,Jane,Doe
2,JohnDoe@gmail.com,John,Doe


## Updating data in DFs

In [139]:
# changing single row and all of its column vaue, i.e., entirely whole row
df1.loc[2] = ['johnsmith@gmail.com', 'John', 'Smith']
df1

Unnamed: 0,email,first,last
0,CoreySchafer@gmail.com,Corey,Schafer
1,JaneDoe@gmail.com,Jane,Doe
2,johnsmith@gmail.com,John,Smith


In [140]:
# changing specific columns for a given row
df1.loc[1, ['email', 'last']] = ['JaneDoodle@email.com', 'Doodle']  # Also, for only single vale to be change, list is not required in LHS or RHS
df1
# one can also use df1.[filt, col(s)] = <[>col(s)<]>

Unnamed: 0,email,first,last
0,CoreySchafer@gmail.com,Corey,Schafer
1,JaneDoodle@email.com,Jane,Doodle
2,johnsmith@gmail.com,John,Smith


In [143]:
df1['email'] = df1['email'].str.lower()
df1

Unnamed: 0,email,first,last
0,coreyschafer@gmail.com,Corey,Schafer
1,janedoodle@email.com,Jane,Doodle
2,johnsmith@gmail.com,John,Smith


### apply
used for calling function on every values of our data structure. Can be applied on Series as well as on DFs. NOT inplace.

In [145]:
df1.email.apply(len)

0    22
1    20
2    19
Name: email, dtype: int64

In [147]:
def update_email(email):
    return email.upper()

df1['email'] = df1.email.apply(update_email)

In [149]:
df1

Unnamed: 0,email,first,last
0,COREYSCHAFER@GMAIL.COM,Corey,Schafer
1,JANEDOODLE@EMAIL.COM,Jane,Doodle
2,JOHNSMITH@GMAIL.COM,John,Smith


In [150]:
df1['email'] = df1.email.apply(lambda x : x.lower())
print(df1)

                    email  first     last
0  coreyschafer@gmail.com  Corey  Schafer
1    janedoodle@email.com   Jane   Doodle
2     johnsmith@gmail.com   John    Smith


In [152]:
# apply on dataframes
print(df1.apply(len))   # axis is set to rows by default
print(df1.apply(len, axis = 'columns'))

email    3
first    3
last     3
dtype: int64
0    3
1    3
2    3
dtype: int64


In [153]:
# finding minimum value from each series of the dataframe
df1.apply(pd.Series.min)

email    coreyschafer@gmail.com
first                     Corey
last                     Doodle
dtype: object

In [154]:
# while lambda function runs on the Series by default. We dont need to explicitly specify the pd.seried while using the dataframe with apply
df1.apply(lambda x : x.min())   # look at it. X is itself a series, and we are finding the minimum from every series. This is useful for numerical analysis.

email    coreyschafer@gmail.com
first                     Corey
last                     Doodle
dtype: object

### applymap
It works only on the dataframes. It does not work on the series data structures. It is used to run functions on every element of the dataframes. NOT inplace.

In [159]:
print(df1.applymap(len))
print(df1.applymap(str.lower))
df1

   email  first  last
0     22      5     7
1     20      4     6
2     19      4     5
                    email  first     last
0  coreyschafer@gmail.com  corey  schafer
1    janedoodle@email.com   jane   doodle
2     johnsmith@gmail.com   john    smith


Unnamed: 0,email,first,last
0,coreyschafer@gmail.com,Corey,Schafer
1,janedoodle@email.com,Jane,Doodle
2,johnsmith@gmail.com,John,Smith


### map
It only works on a series. It is used to substitute each value in the series with another value. NOT inplace.

In [160]:
df1['first'].map({'Corey':'kirrr', 'Jane':'girrrh'})    # those values which are not mentioned in the dictionary, will be turned into NaN.

0     kirrr
1    girrrh
2       NaN
Name: first, dtype: object

### replace
for handling the error of the map that gives NaN, we use replace. Here, only specfied values will get altered. NOT inplace.

In [161]:
df1['first'].replace({'Corey':'kirrr', 'Jane':'girrrh'})

0     kirrr
1    girrrh
2      John
Name: first, dtype: object

In [173]:
# add columns
df1['full_name'] = df1['first'] + ' ' + df1['last']
df1

Unnamed: 0,email,first,last,full_name
0,coreyschafer@gmail.com,Corey,Schafer,Corey Schafer
1,janedoodle@email.com,Jane,Doodle,Jane Doodle
2,johnsmith@gmail.com,John,Smith,John Smith


In [174]:
# removing cols
df1.drop(columns='full_name', inplace=True)   # or, columns=['col1', 'col2', ...]
df1

Unnamed: 0,email,first,last
0,coreyschafer@gmail.com,Corey,Schafer
1,janedoodle@email.com,Jane,Doodle
2,johnsmith@gmail.com,John,Smith


In [175]:
df1['full_name'] = df1['first'] + ' ' + df1['last']

# split column

df1[['firstname', 'lastname']] = df1['full_name'].str.split(' ',expand = True)
df1

Unnamed: 0,email,first,last,full_name,firstname,lastname
0,coreyschafer@gmail.com,Corey,Schafer,Corey Schafer,Corey,Schafer
1,janedoodle@email.com,Jane,Doodle,Jane Doodle,Jane,Doodle
2,johnsmith@gmail.com,John,Smith,John Smith,John,Smith


### removing or adding rows of data

append doesn't have inplace parameter. Hence for permanent change, use df1= df1.appen......................

In [180]:
# adding row of data. This is NOT INPLACE change
df1.append({'firstname':'sharan'}, ignore_index=True)
df1.append({'firstname':['wubba', 'lubba', 'dub', 'dubb']}, ignore_index=True)

Unnamed: 0,email,first,last,full_name,firstname,lastname
0,coreyschafer@gmail.com,Corey,Schafer,Corey Schafer,Corey,Schafer
1,janedoodle@email.com,Jane,Doodle,Jane Doodle,Jane,Doodle
2,johnsmith@gmail.com,John,Smith,John Smith,John,Smith
3,,,,,"[wubba, lubba, dub, dubb]",


In [182]:
ppl = {
    'first' : ['Tony', 'Steve'],
    'last' : ['Stark', 'Rogers'],
    'email' : ['notyourkakkar@avengers.com', 'itsyourcap@avengers.com']
}

In [184]:
ndf = pd.DataFrame(ppl)

In [185]:
df1 = df1.append(ndf, ignore_index=True)

In [186]:
df1

Unnamed: 0,email,first,last,full_name,firstname,lastname
0,coreyschafer@gmail.com,Corey,Schafer,Corey Schafer,Corey,Schafer
1,janedoodle@email.com,Jane,Doodle,Jane Doodle,Jane,Doodle
2,johnsmith@gmail.com,John,Smith,John Smith,John,Smith
3,notyourkakkar@avengers.com,Tony,Stark,,,
4,itsyourcap@avengers.com,Steve,Rogers,,,


In [188]:
# removing rows
df1.drop(index=4)   # NOT INPLACE

Unnamed: 0,email,first,last,full_name,firstname,lastname
0,coreyschafer@gmail.com,Corey,Schafer,Corey Schafer,Corey,Schafer
1,janedoodle@email.com,Jane,Doodle,Jane Doodle,Jane,Doodle
2,johnsmith@gmail.com,John,Smith,John Smith,John,Smith
3,notyourkakkar@avengers.com,Tony,Stark,,,


In [193]:
filt = df1['last'] == 'Doodle'
print(filt)
df1.drop(index=df1[filt].index)

0    False
1     True
2    False
3    False
4    False
Name: last, dtype: bool


Unnamed: 0,email,first,last,full_name,firstname,lastname
0,coreyschafer@gmail.com,Corey,Schafer,Corey Schafer,Corey,Schafer
2,johnsmith@gmail.com,John,Smith,John Smith,John,Smith
3,notyourkakkar@avengers.com,Tony,Stark,,,
4,itsyourcap@avengers.com,Steve,Rogers,,,


### Sorting data in pandas

pass parameter as inplace=True for permanent shufflibg of value

In [196]:
df1

Unnamed: 0,email,first,last,full_name,firstname,lastname
0,coreyschafer@gmail.com,Corey,Schafer,Corey Schafer,Corey,Schafer
1,janedoodle@email.com,Jane,Doodle,Jane Doodle,Jane,Doodle
2,johnsmith@gmail.com,John,Smith,John Smith,John,Smith
3,notyourkakkar@avengers.com,Tony,Stark,,,
4,itsyourcap@avengers.com,Steve,Rogers,,,


In [197]:
df1.sort_values(by = 'last')

Unnamed: 0,email,first,last,full_name,firstname,lastname
1,janedoodle@email.com,Jane,Doodle,Jane Doodle,Jane,Doodle
4,itsyourcap@avengers.com,Steve,Rogers,,,
0,coreyschafer@gmail.com,Corey,Schafer,Corey Schafer,Corey,Schafer
2,johnsmith@gmail.com,John,Smith,John Smith,John,Smith
3,notyourkakkar@avengers.com,Tony,Stark,,,


In [198]:
df1.sort_values(by = ['last', 'first'], ascending = False)

Unnamed: 0,email,first,last,full_name,firstname,lastname
3,notyourkakkar@avengers.com,Tony,Stark,,,
2,johnsmith@gmail.com,John,Smith,John Smith,John,Smith
0,coreyschafer@gmail.com,Corey,Schafer,Corey Schafer,Corey,Schafer
4,itsyourcap@avengers.com,Steve,Rogers,,,
1,janedoodle@email.com,Jane,Doodle,Jane Doodle,Jane,Doodle


In [200]:
df1.sort_values(by = ['first', 'last'], ascending = [False, True])

Unnamed: 0,email,first,last,full_name,firstname,lastname
3,notyourkakkar@avengers.com,Tony,Stark,,,
4,itsyourcap@avengers.com,Steve,Rogers,,,
2,johnsmith@gmail.com,John,Smith,John Smith,John,Smith
1,janedoodle@email.com,Jane,Doodle,Jane Doodle,Jane,Doodle
0,coreyschafer@gmail.com,Corey,Schafer,Corey Schafer,Corey,Schafer


In [201]:
# sorting by index. It is used to restore the DF in the original order of rows as per the internal index
df1.sort_index()

Unnamed: 0,email,first,last,full_name,firstname,lastname
0,coreyschafer@gmail.com,Corey,Schafer,Corey Schafer,Corey,Schafer
1,janedoodle@email.com,Jane,Doodle,Jane Doodle,Jane,Doodle
2,johnsmith@gmail.com,John,Smith,John Smith,John,Smith
3,notyourkakkar@avengers.com,Tony,Stark,,,
4,itsyourcap@avengers.com,Steve,Rogers,,,


In [202]:
#viewing just sorted series of a dataframe
df1['email'].sort_values()

0        coreyschafer@gmail.com
4       itsyourcap@avengers.com
1          janedoodle@email.com
2           johnsmith@gmail.com
3    notyourkakkar@avengers.com
Name: email, dtype: object

In [203]:
# displaying Nth largest values...
# M1: df1[[col1, col2,....]].head(n)    where the dataframe is sorted
# M2: below
df['ConvertedComp'].nlargest(10)    # also there is nsmallest

121     2000000.0
123     2000000.0
191     2000000.0
663     2000000.0
697     2000000.0
722     2000000.0
816     2000000.0
982     2000000.0
1018    2000000.0
1032    2000000.0
Name: ConvertedComp, dtype: float64

In [204]:
df.nlargest(10, 'ConvertedComp')    # also there is nsmallest

Unnamed: 0,Respondent,MainBranch,Hobbyist,Age,Age1stCode,CompFreq,...,WebframeDesireNextYear,WebframeWorkedWith,WelcomeChange,WorkWeekHrs,YearsCode,YearsCodePro
121,123.0,I am a developer by profession,Yes,26.0,12,Weekly,...,Flask;jQuery;React.js,Spring,Just as welcome now as I felt last year,36.0,8,3
123,125.0,"I am not primarily a developer, but I write co...",Yes,41.0,30,Monthly,...,,,Just as welcome now as I felt last year,40.0,11,11
191,193.0,I am a developer by profession,Yes,29.0,16,Weekly,...,,,Just as welcome now as I felt last year,40.0,13,7
663,665.0,I am a developer by profession,Yes,24.0,13,Weekly,...,React.js,Express;React.js;Ruby on Rails,Just as welcome now as I felt last year,40.0,4,Less than 1 year
697,699.0,"I am not primarily a developer, but I write co...",Yes,39.0,16,Weekly,...,Angular;ASP.NET;ASP.NET Core;Express;Flask,,,40.0,5,2
722,724.0,"I am not primarily a developer, but I write co...",Yes,,12,Weekly,...,,React.js,Not applicable - I did not use Stack Overflow ...,40.0,3,3
816,818.0,"I am not primarily a developer, but I write co...",Yes,40.0,15,Weekly,...,Angular;Flask;Spring,,Somewhat less welcome now than last year,40.0,25,2
982,986.0,I am a developer by profession,Yes,27.0,11,Weekly,...,Express;Laravel,Express;jQuery;Laravel,Just as welcome now as I felt last year,40.0,16,8
1018,1022.0,I am a developer by profession,Yes,34.0,16,Weekly,...,React.js,Angular;Angular.js;React.js;Vue.js,Just as welcome now as I felt last year,40.0,18,13
1032,1036.0,I am a developer by profession,Yes,26.0,11,Weekly,...,ASP.NET Core;Vue.js,Angular;ASP.NET;ASP.NET Core;Django;Express;Fl...,A lot more welcome now than last year,40.0,5,3
