# ------------------------ CHAPTER 7 ------------------------

**HANDLING MISSING VALUES**
- NaN - Not a number and NA - not available
- We can filter the missing values using boolean indexing but there are specific methods made for this task
    - dropna 
    - fillna
    - isna
    - notna
- In case of dataframes the whole row or column is being dropped
    - how="all" -> drops rows/column which has all NaN
    - thresh -> here you specify a perticular number of NaN on the basis of that it selects which row or column to delete

In [None]:
import pandas as pd
import numpy as np

data = pd.Series([1, np.nan, 3.5, np.nan, 7])
data.dropna()
data[data.notna()]
data = pd.DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan],[np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]])
data.dropna()
data.dropna(how="all" , axis="columns" , thresh=2) #all of these arguments are optional

data.fillna(0)
data.fillna({1:0.5 , 2:0}) #choosing what to fill in what column using dictionary
data.fillna(method="ffill" , limit=2) #forward fill and limit is optional

**HANDLING DUPLICATES**

In [None]:
data = pd.DataFrame({"k1": ["one", "two"] * 3 + ["two"], "k2": [1,1, 2, 3, 3, 4, 4]})
data.duplicated() #gives a boolean object
data.drop_duplicates(subset=["k1"],keep="last") #In subset we can choose multiple columns and keep tells which values to keep

**TRANSFORMING THE DATA USING MAPPING AND FUNCTION**

In [None]:
data = pd.DataFrame({"food": ["bacon", "pulled pork", "bacon", "pastrami", "corned beef", "bacon", "pastrami", "honey ham", "nova lox"],"ounces": [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
meat_to_animal = {"bacon": "pig","pulled pork": "pig","pastrami": "cow","corned beef": "cow","honey ham": "pig","nova lox": "salmon"}
data["animal"] = data["food"].map(meat_to_animal) #this will map all the animals according to the dictionary

def get_animal(x): #this is a function based approach
    return meat_to_animal(x)
data["food"].map(get_animal)

**REPLACING VALUES**

In [None]:
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data.replace(-999, np.nan) #replacing value
data.replace([-999 , -1000],np.nan) #replacing multiple values
data.replace([-999 , -1000],[np.nan , 0]) #choosing different value for different element
data.replace({-999:np.nan , -1000:0}) #by using dictionary

**RENAMING AXEX INDEXES**

In [None]:
data = pd.DataFrame(np.arange(12).reshape((3, 4)), index=["Ohio","Colorado", "New York"], columns=["one", "two", "three", "four"])
def transform(x):
    return x[:4].upper()
data.index = data.index.map(transform)  #it will change the index to uppercase for all characteers and this method affect the original dataframe

data.rename(index=str.title , columns=str.upper) #This does not change the original dataframe
data.rename(index={"OHIO":"INDIANA"},columns={"three":"pikaboo"}) #using dictionary

**DISCRETIZATION AND BINNING**
- It is like dividing continous data into discrete bins like age into groups
- see the output of pd.cut() -> it is good to see that once for better understanding

In [None]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
age_categories = pd.cut(ages , bin , right = False) #it categorize the ages on making intervals using bins and by default the right side of the interval is closed
age_categories.value_counts()

data = np.random.uniform(size=20)
pd.cut(data , 4 , precision=2) #equal length bins

data = np.random.standard_normal(1000) #quantile based bins(qcut)
quartiles = pd.cut(data , 4 , precision=2)
quartiles.value_counts()

**DETECTING AND FILTRING OUTLIERS**

In [None]:
data = pd.DataFrame(np.random.standard_normal((1000 , 4)))
col = data[2]
col[col.abs() > 3]  #values > 3 or < -3

data[(data.abs() > 3).any(axis="columns")] #all rows with outliers
data[data.abs() > 3] = np.sign(data) * 3