# **4. DATA WRANGLING**

*Data wrangling* is the process of transforming or preparing datasets to ensure they are actionable for a set of analysis tasks.

Goals of data wrangling:
- to make data usable in order to be parsed and manipulated by analysis tools.

- to ensure that data are responsive to the intended analyses, *i.e.,* that contain the necessary information at an acceptable level of description and correctness, to support successful modeling and decision-making.

Import libraries:

In [1]:
import pandas as pd

Import data:

In [2]:
USA = pd.read_csv("https://raw.githubusercontent.com/camillasancricca/DATADIQ/master/TECHUSA.csv")
USA

Unnamed: 0,permalink,company,numEmps,category,city,state,fundedDate,raisedAmt,raisedCurrency,round
0,lifelock,LifeLock,,web,Tempe,AZ,1-May-07,6850000,USD,b
1,lifelock,LifeLock,,web,Tempe,AZ,1-Oct-06,6000000,USD,a
2,lifelock,LifeLock,,web,Tempe,AZ,1-Jan-08,25000000,USD,c
3,mycityfaces,MyCityFaces,7.0,web,Scottsdale,AZ,1-Jan-08,50000,USD,seed
4,flypaper,Flypaper,,web,Phoenix,AZ,1-Feb-08,3000000,USD,a
...,...,...,...,...,...,...,...,...,...,...
1455,trusera,Trusera,15.0,web,Seattle,WA,1-Jun-07,2000000,USD,angel
1456,alerts-com,Alerts.com,,web,Bellevue,WA,8-Jul-08,1200000,USD,a
1457,myrio,Myrio,75.0,software,Bothell,WA,1-Jan-01,20500000,USD,unattributed
1458,grid-networks,Grid Networks,,web,Seattle,WA,30-Oct-07,9500000,USD,a


In [3]:
USA.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   permalink       1460 non-null   object 
 1   company         1460 non-null   object 
 2   numEmps         567 non-null    float64
 3   category        1436 non-null   object 
 4   city            1442 non-null   object 
 5   state           1460 non-null   object 
 6   fundedDate      1460 non-null   object 
 7   raisedAmt       1460 non-null   int64  
 8   raisedCurrency  1460 non-null   object 
 9   round           1460 non-null   object 
dtypes: float64(1), int64(1), object(8)
memory usage: 114.2+ KB


1. Column Renaiming:

In [4]:
#Change the name of Column "category" to "cat"
USA.rename(columns={'category' :'cat'})

Unnamed: 0,permalink,company,numEmps,cat,city,state,fundedDate,raisedAmt,raisedCurrency,round
0,lifelock,LifeLock,,web,Tempe,AZ,1-May-07,6850000,USD,b
1,lifelock,LifeLock,,web,Tempe,AZ,1-Oct-06,6000000,USD,a
2,lifelock,LifeLock,,web,Tempe,AZ,1-Jan-08,25000000,USD,c
3,mycityfaces,MyCityFaces,7.0,web,Scottsdale,AZ,1-Jan-08,50000,USD,seed
4,flypaper,Flypaper,,web,Phoenix,AZ,1-Feb-08,3000000,USD,a
...,...,...,...,...,...,...,...,...,...,...
1455,trusera,Trusera,15.0,web,Seattle,WA,1-Jun-07,2000000,USD,angel
1456,alerts-com,Alerts.com,,web,Bellevue,WA,8-Jul-08,1200000,USD,a
1457,myrio,Myrio,75.0,software,Bothell,WA,1-Jan-01,20500000,USD,unattributed
1458,grid-networks,Grid Networks,,web,Seattle,WA,30-Oct-07,9500000,USD,a


In [None]:
#Change the name of multiple columns: "category" to "cat" and fundedDate to FD
USA.rename(columns ={"Company":"Comp","fundedDate": "FD"})

2. Sorting:

In [6]:
USA.sort_values('company', ascending=True)

Unnamed: 0,permalink,company,numEmps,category,city,state,fundedDate,raisedAmt,raisedCurrency,round
435,23andme,23andMe,30.0,web,Mountain View,CA,1-May-07,9000000,USD,a
417,3jam,3Jam,,web,Menlo Park,CA,1-Jul-07,4000000,USD,a
557,4homemedia,4HomeMedia,10.0,web,Sunnyvale,CA,1-Jan-07,2850000,USD,a
1183,5min,5min,8.0,web,New York,NY,1-Apr-07,300000,USD,angel
1184,5min,5min,8.0,web,New York,NY,1-Nov-07,5000000,USD,a
...,...,...,...,...,...,...,...,...,...,...
504,uber,uber,24.0,web,Beverly Hills,CA,26-May-08,7600000,USD,b
1071,utoopia,utoopia,2.0,web,Boston,MA,1-Mar-07,100000,USD,seed
1266,vbs-tv,vbs tv,40.0,other,Brooklyn,NY,1-Dec-06,10000000,USD,seed
1260,x-1,x+1,,web,New York,NY,1-Jun-08,16000000,USD,a


3. Column Filtering:

In [7]:
#Filter the data considering a specific selection predicate
USA_filtered = USA[USA['city'] == "Mountain View"]
USA_filtered

Unnamed: 0,permalink,company,numEmps,category,city,state,fundedDate,raisedAmt,raisedCurrency,round
34,plaxo,Plaxo,50.0,web,Mountain View,CA,1-Nov-02,3800000,USD,a
35,plaxo,Plaxo,50.0,web,Mountain View,CA,1-Jul-03,8500000,USD,b
36,plaxo,Plaxo,50.0,web,Mountain View,CA,1-Apr-04,7000000,USD,c
37,plaxo,Plaxo,50.0,web,Mountain View,CA,1-Feb-07,9000000,USD,d
68,google,Google,20000.0,web,Mountain View,CA,1-Jun-99,25000000,USD,a
...,...,...,...,...,...,...,...,...,...,...
874,plastic-logic,Plastic Logic,,hardware,Mountain View,CA,20-Apr-02,13700000,USD,a
875,plastic-logic,Plastic Logic,,hardware,Mountain View,CA,5-Jan-05,8000000,USD,b
876,plastic-logic,Plastic Logic,,hardware,Mountain View,CA,30-Nov-05,24000000,USD,c
877,plastic-logic,Plastic Logic,,hardware,Mountain View,CA,6-Jan-07,100000000,USD,d


4. Dropping:

In [8]:
#Drop rows considering specific filtered data
USA_final = USA.drop(USA[USA['city'] == 'Tempe'].index)
USA_final

Unnamed: 0,permalink,company,numEmps,category,city,state,fundedDate,raisedAmt,raisedCurrency,round
3,mycityfaces,MyCityFaces,7.0,web,Scottsdale,AZ,1-Jan-08,50000,USD,seed
4,flypaper,Flypaper,,web,Phoenix,AZ,1-Feb-08,3000000,USD,a
5,infusionsoft,Infusionsoft,105.0,software,Gilbert,AZ,1-Oct-07,9000000,USD,a
6,gauto,gAuto,4.0,web,Scottsdale,AZ,1-Jan-08,250000,USD,seed
7,chosenlist-com,ChosenList.com,5.0,web,Scottsdale,AZ,1-Oct-06,140000,USD,seed
...,...,...,...,...,...,...,...,...,...,...
1455,trusera,Trusera,15.0,web,Seattle,WA,1-Jun-07,2000000,USD,angel
1456,alerts-com,Alerts.com,,web,Bellevue,WA,8-Jul-08,1200000,USD,a
1457,myrio,Myrio,75.0,software,Bothell,WA,1-Jan-01,20500000,USD,unattributed
1458,grid-networks,Grid Networks,,web,Seattle,WA,30-Oct-07,9500000,USD,a


5. Standardization:

  *Standardization* consists of reorganizazation of composed fields, data type checks, and replacement of alternative spellings with a single one. This type of reorganization has the purpose of making comparisons easier.

In [9]:
#Change a value (in this case we want to avoid abbreviations)
USA.loc[USA['state'] == 'AZ', 'state'] = 'Arizona'
USA

Unnamed: 0,permalink,company,numEmps,category,city,state,fundedDate,raisedAmt,raisedCurrency,round
0,lifelock,LifeLock,,web,Tempe,Arizona,1-May-07,6850000,USD,b
1,lifelock,LifeLock,,web,Tempe,Arizona,1-Oct-06,6000000,USD,a
2,lifelock,LifeLock,,web,Tempe,Arizona,1-Jan-08,25000000,USD,c
3,mycityfaces,MyCityFaces,7.0,web,Scottsdale,Arizona,1-Jan-08,50000,USD,seed
4,flypaper,Flypaper,,web,Phoenix,Arizona,1-Feb-08,3000000,USD,a
...,...,...,...,...,...,...,...,...,...,...
1455,trusera,Trusera,15.0,web,Seattle,WA,1-Jun-07,2000000,USD,angel
1456,alerts-com,Alerts.com,,web,Bellevue,WA,8-Jul-08,1200000,USD,a
1457,myrio,Myrio,75.0,software,Bothell,WA,1-Jan-01,20500000,USD,unattributed
1458,grid-networks,Grid Networks,,web,Seattle,WA,30-Oct-07,9500000,USD,a


6. Column Splitting:

In [10]:
NBA = pd.read_csv("https://raw.githubusercontent.com/camillasancricca/DATADIQ/master/NBA.csv")
NBA

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0
...,...,...,...,...,...,...,...,...,...
442,Trey Lyles,Utah Jazz,41.0,PF,20.0,6-10,234.0,Kentucky,2239800.0
443,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0
444,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,,900000.0
445,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0


In [11]:
#Split a column into two
NBA_split_1 = NBA.Name.str.split(expand=True)
NBA_split_1

Unnamed: 0,0,1
0,Avery,Bradley
1,Jae,Crowder
2,John,Holland
3,R.J.,Hunter
4,Jonas,Jerebko
...,...,...
442,Trey,Lyles
443,Shelvin,Mack
444,Raul,Neto
445,Tibor,Pleiss


In [12]:
#Add two columns generated from the split function
NBA[['First', 'Last']] = NBA.Name.str.split(" ", n=1, expand=True)
NBA

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,First,Last
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0,Avery,Bradley
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0,Jae,Crowder
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,,John,Holland
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0,R.J.,Hunter
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0,Jonas,Jerebko
...,...,...,...,...,...,...,...,...,...,...,...
442,Trey Lyles,Utah Jazz,41.0,PF,20.0,6-10,234.0,Kentucky,2239800.0,Trey,Lyles
443,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0,Shelvin,Mack
444,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,,900000.0,Raul,Neto
445,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0,Tibor,Pleiss


In [13]:
#Drop the original column
NBA = NBA.drop(["Name"], axis = 1)
NBA

Unnamed: 0,Team,Number,Position,Age,Height,Weight,College,Salary,First,Last
0,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0,Avery,Bradley
1,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0,Jae,Crowder
2,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,,John,Holland
3,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0,R.J.,Hunter
4,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0,Jonas,Jerebko
...,...,...,...,...,...,...,...,...,...,...
442,Utah Jazz,41.0,PF,20.0,6-10,234.0,Kentucky,2239800.0,Trey,Lyles
443,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0,Shelvin,Mack
444,Utah Jazz,25.0,PG,24.0,6-1,179.0,,900000.0,Raul,Neto
445,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0,Tibor,Pleiss


7. Column Merging:

In [14]:
#The space is the default separator, but if the separator is different we have to explicit it
NBA["Complete Name"] = NBA['Last'].str.cat(NBA[['First']], sep=',')
NBA

Unnamed: 0,Team,Number,Position,Age,Height,Weight,College,Salary,First,Last,Complete Name
0,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0,Avery,Bradley,"Bradley,Avery"
1,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0,Jae,Crowder,"Crowder,Jae"
2,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,,John,Holland,"Holland,John"
3,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0,R.J.,Hunter,"Hunter,R.J."
4,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0,Jonas,Jerebko,"Jerebko,Jonas"
...,...,...,...,...,...,...,...,...,...,...,...
442,Utah Jazz,41.0,PF,20.0,6-10,234.0,Kentucky,2239800.0,Trey,Lyles,"Lyles,Trey"
443,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0,Shelvin,Mack,"Mack,Shelvin"
444,Utah Jazz,25.0,PG,24.0,6-1,179.0,,900000.0,Raul,Neto,"Neto,Raul"
445,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0,Tibor,Pleiss,"Pleiss,Tibor"


In [15]:
#Create a new Dataframe of addresses
ADDR = pd.DataFrame({
    'Address': ['4860 Sunset Boulevard,San Francisco,California',
                '3055 Paradise Lane,Salt Lake City,Utah',
                '682 Main Street,Detroit,Michigan',
                '9001 Cascade Road,Kansas City,Missouri']
})
ADDR

Unnamed: 0,Address
0,"4860 Sunset Boulevard,San Francisco,California"
1,"3055 Paradise Lane,Salt Lake City,Utah"
2,"682 Main Street,Detroit,Michigan"
3,"9001 Cascade Road,Kansas City,Missouri"


In [16]:
#We can split into more than 2 columns
#n defines the number of times to split in one tuple
ADDR['Address'].str.split(',',n=2, expand=True)

Unnamed: 0,0,1,2
0,4860 Sunset Boulevard,San Francisco,California
1,3055 Paradise Lane,Salt Lake City,Utah
2,682 Main Street,Detroit,Michigan
3,9001 Cascade Road,Kansas City,Missouri


In [17]:
#Split a column and add new columns
ADDR[['Street', 'City', 'State']] = ADDR['Address'].str.split(',', expand=True)
ADDR

Unnamed: 0,Address,Street,City,State
0,"4860 Sunset Boulevard,San Francisco,California",4860 Sunset Boulevard,San Francisco,California
1,"3055 Paradise Lane,Salt Lake City,Utah",3055 Paradise Lane,Salt Lake City,Utah
2,"682 Main Street,Detroit,Michigan",682 Main Street,Detroit,Michigan
3,"9001 Cascade Road,Kansas City,Missouri",9001 Cascade Road,Kansas City,Missouri


In [18]:
#Drop the original column
ADDR.drop(['Address'], axis=1)

Unnamed: 0,Street,City,State
0,4860 Sunset Boulevard,San Francisco,California
1,3055 Paradise Lane,Salt Lake City,Utah
2,682 Main Street,Detroit,Michigan
3,9001 Cascade Road,Kansas City,Missouri


8. Dealing with Missing Data:

  - drop missing values with ***dropna***
  (axis --> "0": drops the rows with missing;
  "1": drops column with missing)

  - fill missing values with ***fillna***
  (method --> "ffill": propagate last valid observation forward to next valid;
  "bfill": use next valid observation to fill gap; axis (along which to fill missing values) --> "0": rows; "1": columns)


In [19]:
USA = pd.read_csv("https://raw.githubusercontent.com/camillasancricca/DATADIQ/master/TECHUSA.csv")
USA

Unnamed: 0,permalink,company,numEmps,category,city,state,fundedDate,raisedAmt,raisedCurrency,round
0,lifelock,LifeLock,,web,Tempe,AZ,1-May-07,6850000,USD,b
1,lifelock,LifeLock,,web,Tempe,AZ,1-Oct-06,6000000,USD,a
2,lifelock,LifeLock,,web,Tempe,AZ,1-Jan-08,25000000,USD,c
3,mycityfaces,MyCityFaces,7.0,web,Scottsdale,AZ,1-Jan-08,50000,USD,seed
4,flypaper,Flypaper,,web,Phoenix,AZ,1-Feb-08,3000000,USD,a
...,...,...,...,...,...,...,...,...,...,...
1455,trusera,Trusera,15.0,web,Seattle,WA,1-Jun-07,2000000,USD,angel
1456,alerts-com,Alerts.com,,web,Bellevue,WA,8-Jul-08,1200000,USD,a
1457,myrio,Myrio,75.0,software,Bothell,WA,1-Jan-01,20500000,USD,unattributed
1458,grid-networks,Grid Networks,,web,Seattle,WA,30-Oct-07,9500000,USD,a


In [20]:
USA_drop_rows = USA.dropna(axis=0)
USA_drop_rows

Unnamed: 0,permalink,company,numEmps,category,city,state,fundedDate,raisedAmt,raisedCurrency,round
3,mycityfaces,MyCityFaces,7.0,web,Scottsdale,AZ,1-Jan-08,50000,USD,seed
5,infusionsoft,Infusionsoft,105.0,software,Gilbert,AZ,1-Oct-07,9000000,USD,a
6,gauto,gAuto,4.0,web,Scottsdale,AZ,1-Jan-08,250000,USD,seed
7,chosenlist-com,ChosenList.com,5.0,web,Scottsdale,AZ,1-Oct-06,140000,USD,seed
8,chosenlist-com,ChosenList.com,5.0,web,Scottsdale,AZ,25-Jan-08,233750,USD,angel
...,...,...,...,...,...,...,...,...,...,...
1448,teachstreet,TeachStreet,8.0,web,Seattle,WA,1-Mar-08,2250000,USD,a
1453,cozi,Cozi,26.0,software,Seattle,WA,11-Jul-07,3000000,USD,a
1454,cozi,Cozi,26.0,software,Seattle,WA,1-Jun-08,8000000,USD,c
1455,trusera,Trusera,15.0,web,Seattle,WA,1-Jun-07,2000000,USD,angel


In [21]:
USA_drop_cols = USA.dropna(axis=1)
USA_drop_cols

Unnamed: 0,permalink,company,state,fundedDate,raisedAmt,raisedCurrency,round
0,lifelock,LifeLock,AZ,1-May-07,6850000,USD,b
1,lifelock,LifeLock,AZ,1-Oct-06,6000000,USD,a
2,lifelock,LifeLock,AZ,1-Jan-08,25000000,USD,c
3,mycityfaces,MyCityFaces,AZ,1-Jan-08,50000,USD,seed
4,flypaper,Flypaper,AZ,1-Feb-08,3000000,USD,a
...,...,...,...,...,...,...,...
1455,trusera,Trusera,WA,1-Jun-07,2000000,USD,angel
1456,alerts-com,Alerts.com,WA,8-Jul-08,1200000,USD,a
1457,myrio,Myrio,WA,1-Jan-01,20500000,USD,unattributed
1458,grid-networks,Grid Networks,WA,30-Oct-07,9500000,USD,a


In [22]:
USA.fillna(method='ffill')
USA.ffill(axis=0)

  USA.fillna(method='ffill')


Unnamed: 0,permalink,company,numEmps,category,city,state,fundedDate,raisedAmt,raisedCurrency,round
0,lifelock,LifeLock,,web,Tempe,AZ,1-May-07,6850000,USD,b
1,lifelock,LifeLock,,web,Tempe,AZ,1-Oct-06,6000000,USD,a
2,lifelock,LifeLock,,web,Tempe,AZ,1-Jan-08,25000000,USD,c
3,mycityfaces,MyCityFaces,7.0,web,Scottsdale,AZ,1-Jan-08,50000,USD,seed
4,flypaper,Flypaper,7.0,web,Phoenix,AZ,1-Feb-08,3000000,USD,a
...,...,...,...,...,...,...,...,...,...,...
1455,trusera,Trusera,15.0,web,Seattle,WA,1-Jun-07,2000000,USD,angel
1456,alerts-com,Alerts.com,15.0,web,Bellevue,WA,8-Jul-08,1200000,USD,a
1457,myrio,Myrio,75.0,software,Bothell,WA,1-Jan-01,20500000,USD,unattributed
1458,grid-networks,Grid Networks,75.0,web,Seattle,WA,30-Oct-07,9500000,USD,a


In [23]:
USA.fillna(method='bfill')
USA.bfill(axis=0)

  USA.fillna(method='bfill')


Unnamed: 0,permalink,company,numEmps,category,city,state,fundedDate,raisedAmt,raisedCurrency,round
0,lifelock,LifeLock,7.0,web,Tempe,AZ,1-May-07,6850000,USD,b
1,lifelock,LifeLock,7.0,web,Tempe,AZ,1-Oct-06,6000000,USD,a
2,lifelock,LifeLock,7.0,web,Tempe,AZ,1-Jan-08,25000000,USD,c
3,mycityfaces,MyCityFaces,7.0,web,Scottsdale,AZ,1-Jan-08,50000,USD,seed
4,flypaper,Flypaper,105.0,web,Phoenix,AZ,1-Feb-08,3000000,USD,a
...,...,...,...,...,...,...,...,...,...,...
1455,trusera,Trusera,15.0,web,Seattle,WA,1-Jun-07,2000000,USD,angel
1456,alerts-com,Alerts.com,75.0,web,Bellevue,WA,8-Jul-08,1200000,USD,a
1457,myrio,Myrio,75.0,software,Bothell,WA,1-Jan-01,20500000,USD,unattributed
1458,grid-networks,Grid Networks,,web,Seattle,WA,30-Oct-07,9500000,USD,a


In [24]:
USA.fillna(0)

Unnamed: 0,permalink,company,numEmps,category,city,state,fundedDate,raisedAmt,raisedCurrency,round
0,lifelock,LifeLock,0.0,web,Tempe,AZ,1-May-07,6850000,USD,b
1,lifelock,LifeLock,0.0,web,Tempe,AZ,1-Oct-06,6000000,USD,a
2,lifelock,LifeLock,0.0,web,Tempe,AZ,1-Jan-08,25000000,USD,c
3,mycityfaces,MyCityFaces,7.0,web,Scottsdale,AZ,1-Jan-08,50000,USD,seed
4,flypaper,Flypaper,0.0,web,Phoenix,AZ,1-Feb-08,3000000,USD,a
...,...,...,...,...,...,...,...,...,...,...
1455,trusera,Trusera,15.0,web,Seattle,WA,1-Jun-07,2000000,USD,angel
1456,alerts-com,Alerts.com,0.0,web,Bellevue,WA,8-Jul-08,1200000,USD,a
1457,myrio,Myrio,75.0,software,Bothell,WA,1-Jan-01,20500000,USD,unattributed
1458,grid-networks,Grid Networks,0.0,web,Seattle,WA,30-Oct-07,9500000,USD,a


In [25]:
USA.fillna("missing")

Unnamed: 0,permalink,company,numEmps,category,city,state,fundedDate,raisedAmt,raisedCurrency,round
0,lifelock,LifeLock,missing,web,Tempe,AZ,1-May-07,6850000,USD,b
1,lifelock,LifeLock,missing,web,Tempe,AZ,1-Oct-06,6000000,USD,a
2,lifelock,LifeLock,missing,web,Tempe,AZ,1-Jan-08,25000000,USD,c
3,mycityfaces,MyCityFaces,7.0,web,Scottsdale,AZ,1-Jan-08,50000,USD,seed
4,flypaper,Flypaper,missing,web,Phoenix,AZ,1-Feb-08,3000000,USD,a
...,...,...,...,...,...,...,...,...,...,...
1455,trusera,Trusera,15.0,web,Seattle,WA,1-Jun-07,2000000,USD,angel
1456,alerts-com,Alerts.com,missing,web,Bellevue,WA,8-Jul-08,1200000,USD,a
1457,myrio,Myrio,75.0,software,Bothell,WA,1-Jan-01,20500000,USD,unattributed
1458,grid-networks,Grid Networks,missing,web,Seattle,WA,30-Oct-07,9500000,USD,a


9. Formatting values:

In [26]:
#methods to format values
df = pd.DataFrame(
    data={
        "Currency": {0: 111.23, 1: 321.23}
    }
)
format_mapping = {"Currency": "{:,.2f}"}
df.style.format(format_mapping)

Unnamed: 0,Currency
0,111.23
1,321.23


In [27]:
df = pd.DataFrame(
    data={
        "Int": {0: 23, 1: 3}
    }
)
format_mapping = {"Int": "{:,.0f}"}
df.style.format(format_mapping)

Unnamed: 0,Int
0,23
1,3


In [28]:
df = pd.DataFrame(
    data={
        "Rate": {0: 0.03030, 1: 0.09840},
    }
)
format_mapping = {"Rate": "{:.2f}%"}
df.style.format(format_mapping)

Unnamed: 0,Rate
0,0.03%
1,0.10%


10. Set columns names:

In [29]:
df = pd.DataFrame([(1,2,3), ('foo','bar','baz'), (4,5,6)])
df

Unnamed: 0,0,1,2
0,1,2,3
1,foo,bar,baz
2,4,5,6


In [30]:
df.columns = df.iloc[1]
df

1,foo,bar,baz
0,1,2,3
1,foo,bar,baz
2,4,5,6


In [31]:
df.drop(df.index[1])

1,foo,bar,baz
0,1,2,3
2,4,5,6


11. Transpose:

In [32]:
df.transpose()

Unnamed: 0_level_0,0,1,2
1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
foo,1,foo,4
bar,2,bar,5
baz,3,baz,6


**Summary:**

*Data Wrangling operations:*
- DataFrame.info()
- DataFrame.rename()
- DataFrame.sort_values()
- DataFrame.drop()
- DataFrame.Name.str.split()
- DataFrame.str.cat()
- DataFrame.dropna()
- DataFrame.fillna()
- DataFrame.style.format()
- DataFrame.iloc()
- DataFrame.transpose()