## 1. Changing types

In [3]:
import pandas as pd
titanic = pd.read_csv("data/titanic.csv")
houses = pd.read_csv("data/kc_house_data.csv")
netflix = pd.read_csv("data/netflix_titles.csv",sep = "|", index_col = 0)
btc = pd.read_csv("data/coin_Bitcoin.csv")
countries = pd.read_csv("data/world-happiness-report-2021.csv").set_index("Country name")

In [3]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   pclass     1309 non-null   int64 
 1   survived   1309 non-null   int64 
 2   name       1309 non-null   object
 3   sex        1309 non-null   object
 4   age        1309 non-null   object
 5   sibsp      1309 non-null   int64 
 6   parch      1309 non-null   int64 
 7   ticket     1309 non-null   object
 8   fare       1309 non-null   object
 9   cabin      1309 non-null   object
 10  embarked   1309 non-null   object
 11  boat       1309 non-null   object
 12  body       1309 non-null   object
 13  home.dest  1309 non-null   object
dtypes: int64(4), object(10)
memory usage: 143.3+ KB


In [16]:
titanic.age.replace(["?"],[None], inplace = True)

In [7]:
titanic["age"].value_counts()

age
24        47
22        43
21        41
30        40
18        39
          ..
0.3333     1
22.5       1
70.5       1
0.6667     1
26.5       1
Name: count, Length: 98, dtype: int64

**astype()**

In [8]:
titanic.age.astype("float")

0       29.0000
1        0.9167
2        2.0000
3       30.0000
4       25.0000
         ...   
1304    14.5000
1305        NaN
1306    26.5000
1307    27.0000
1308    29.0000
Name: age, Length: 1309, dtype: float64

In [9]:
titanic.age

0           29
1       0.9167
2            2
3           30
4           25
         ...  
1304      14.5
1305      None
1306      26.5
1307        27
1308        29
Name: age, Length: 1309, dtype: object

In [11]:
titanic.age = titanic.age.astype("float")
titanic.age

0       29.0000
1        0.9167
2        2.0000
3       30.0000
4       25.0000
         ...   
1304    14.5000
1305        NaN
1306    26.5000
1307    27.0000
1308    29.0000
Name: age, Length: 1309, dtype: float64

Replace permanently.

In [12]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pclass     1309 non-null   int64  
 1   survived   1309 non-null   int64  
 2   name       1309 non-null   object 
 3   sex        1309 non-null   object 
 4   age        1046 non-null   float64
 5   sibsp      1309 non-null   int64  
 6   parch      1309 non-null   int64  
 7   ticket     1309 non-null   object 
 8   fare       1309 non-null   object 
 9   cabin      1309 non-null   object 
 10  embarked   1309 non-null   object 
 11  boat       1309 non-null   object 
 12  body       1309 non-null   object 
 13  home.dest  1309 non-null   object 
dtypes: float64(1), int64(4), object(9)
memory usage: 143.3+ KB


In [13]:
titanic.age.mean()

29.8811345124283

**categories**

It will optimize the size.

## 2. `to_numeric()`

https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html#pandas-to-numeric
The default return dtype is float64 or int64. We can give data contains "?". We don't need to replace.

In [18]:
titanic.age

0       29.0000
1        0.9167
2        2.0000
3       30.0000
4       25.0000
         ...   
1304    14.5000
1305        NaN
1306    26.5000
1307    27.0000
1308    29.0000
Name: age, Length: 1309, dtype: float64

errors:{'ignore’, ‘raise’, ‘coerce’}, default ‘raise’
+ If ‘raise’, then invalid parsing will raise an exception.

+ If ‘coerce’, then invalid parsing will be set as NaN.

+ If ‘ignore’, then invalid parsing will return the input.

In [19]:
pd.to_numeric(titanic.age, errors = "coerce")

0       29.0000
1        0.9167
2        2.0000
3       30.0000
4       25.0000
         ...   
1304    14.5000
1305        NaN
1306    26.5000
1307    27.0000
1308    29.0000
Name: age, Length: 1309, dtype: float64

## 3. `dropna()`

In [4]:
game = pd.read_csv("data/game_stats.csv")

In [5]:
game

Unnamed: 0,name,league,points,assists,rebounds
0,bob,nba,22.0,5.0,10.0
1,jessie,,10.0,,2.0
2,stu,euroleague,,,
3,jackson,aba,9.0,,2.0
4,timothee,,8.0,,
5,steph,nba,49.0,8.0,10.0
6,,,,,


**isna()**

In [6]:
game.isna()

Unnamed: 0,name,league,points,assists,rebounds
0,False,False,False,False,False
1,False,True,False,True,False
2,False,False,True,True,True
3,False,False,False,True,False
4,False,True,False,True,True
5,False,False,False,False,False
6,True,True,True,True,True


**dropna()**
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html#pandas.DataFrame.dropna

In [7]:
game.assists.dropna()

0    5.0
5    8.0
Name: assists, dtype: float64

`how{‘any’, ‘all’}`, default ‘any’
Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.

+ ‘any’ : If any NA values are present, drop that row or column.

+ ‘all’ : If all values are NA, drop that row or column.

In [9]:
game.dropna(how = "all")

Unnamed: 0,name,league,points,assists,rebounds
0,bob,nba,22.0,5.0,10.0
1,jessie,,10.0,,2.0
2,stu,euroleague,,,
3,jackson,aba,9.0,,2.0
4,timothee,,8.0,,
5,steph,nba,49.0,8.0,10.0


`subset`: column label or sequence of labels, optional
Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include.

In [12]:
game.dropna(subset = ["rebounds", "league"])

Unnamed: 0,name,league,points,assists,rebounds
0,bob,nba,22.0,5.0,10.0
3,jackson,aba,9.0,,2.0
5,steph,nba,49.0,8.0,10.0


drop row when the rebounds = None, and league = None

## 4. `fillna()`

In [13]:
game

Unnamed: 0,name,league,points,assists,rebounds
0,bob,nba,22.0,5.0,10.0
1,jessie,,10.0,,2.0
2,stu,euroleague,,,
3,jackson,aba,9.0,,2.0
4,timothee,,8.0,,
5,steph,nba,49.0,8.0,10.0
6,,,,,


In [14]:
game.fillna(0)

Unnamed: 0,name,league,points,assists,rebounds
0,bob,nba,22.0,5.0,10.0
1,jessie,0,10.0,0.0,2.0
2,stu,euroleague,0.0,0.0,0.0
3,jackson,aba,9.0,0.0,2.0
4,timothee,0,8.0,0.0,0.0
5,steph,nba,49.0,8.0,10.0
6,0,0,0.0,0.0,0.0


In [15]:
game.fillna({"points": 0, "assists":"无"})

Unnamed: 0,name,league,points,assists,rebounds
0,bob,nba,22.0,5.0,10.0
1,jessie,,10.0,无,2.0
2,stu,euroleague,0.0,无,
3,jackson,aba,9.0,无,2.0
4,timothee,,8.0,无,
5,steph,nba,49.0,8.0,10.0
6,,,0.0,无,


In [16]:
pd.read_csv("data/sales.csv")

Unnamed: 0,rating,shipping_zip,billing_zip
0,5.0,,81220.0
1,4.5,94931.0,94931.0
2,,92625.0,92625.0
3,4.5,10003.0,10003.0
4,4.0,,92660.0
5,,,
6,,60007.0,60007.0
