# ------------------------ CHAPTER 7 ------------------------

**HANDLING MISSING VALUES**
- NaN - Not a number and NA - not available
- We can filter the missing values using boolean indexing but there are specific methods made for this task
    - dropna 
    - fillna
    - isna
    - notna
- In case of dataframes the whole row or column is being dropped
    - how="all" -> drops rows/column which has all NaN
    - thresh -> here you specify a perticular number of NaN on the basis of that it selects which row or column to delete

In [None]:
import pandas as pd
import numpy as np

data = pd.Series([1, np.nan, 3.5, np.nan, 7])
data.dropna()
data[data.notna()]
data = pd.DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan],[np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]])
data.dropna()
data.dropna(how="all" , axis="columns") #drops rows/column with all null values
data.dropna(axis="columns", thresh= 2) #drops rows/columns with min 2 null values

data.fillna(0)
data.fillna({1:0.5 , 2:0}) #choosing what to fill in what column using dictionary
data.fillna(method="ffill" , limit=2) #forward fill and limit is optional

**HANDLING DUPLICATES**

In [None]:
data = pd.DataFrame({"k1": ["one", "two"] * 3 + ["two"], "k2": [1,1, 2, 3, 3, 4, 4]})
data.duplicated() #gives a boolean object
data.drop_duplicates(subset=["k1"],keep="last") #In subset we can choose multiple columns and keep tells which values to keep

**TRANSFORMING THE DATA USING MAPPING AND FUNCTION**

In [None]:
data = pd.DataFrame({"food": ["bacon", "pulled pork", "bacon", "pastrami", "corned beef", "bacon", "pastrami", "honey ham", "nova lox"],"ounces": [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
meat_to_animal = {"bacon": "pig","pulled pork": "pig","pastrami": "cow","corned beef": "cow","honey ham": "pig","nova lox": "salmon"}
data["animal"] = data["food"].map(meat_to_animal) #this will map all the animals according to the dictionary

def get_animal(x): #this is a function based approach
    return meat_to_animal(x)
data["food"].map(get_animal)

**REPLACING VALUES**

In [None]:
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data.replace(-999, np.nan) #replacing value
data.replace([-999 , -1000],np.nan) #replacing multiple values
data.replace([-999 , -1000],[np.nan , 0]) #choosing different value for different element
data.replace({-999:np.nan , -1000:0}) #by using dictionary

**RENAMING AXEX INDEXES**

In [None]:
data = pd.DataFrame(np.arange(12).reshape((3, 4)), index=["Ohio","Colorado", "New York"], columns=["one", "two", "three", "four"])
def transform(x):
    return x[:4].upper()
data.index = data.index.map(transform)  #it will change the index to uppercase for all characteers and this method affect the original dataframe

data.rename(index=str.title , columns=str.upper) #This does not change the original dataframe
data.rename(index={"OHIO":"INDIANA"},columns={"three":"pikaboo"}) #using dictionary

**DISCRETIZATION AND BINNING**
- It is like dividing continous data into discrete bins like age into groups
- see the output of pd.cut() -> it is good to see that once for better understanding

In [None]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
age_categories = pd.cut(ages , bin , right = False) #it categorize the ages on making intervals using bins and by default the right side of the interval is closed
age_categories.value_counts()

data = np.random.uniform(size=20)
pd.cut(data , 4 , precision=2) #equal length bins

data = np.random.standard_normal(1000) #quantile based bins(qcut)
quartiles = pd.cut(data , 4 , precision=2)
quartiles.value_counts()

**DETECTING AND FILTRING OUTLIERS**

In [None]:
data = pd.DataFrame(np.random.standard_normal((1000 , 4)))
col = data[2]
col[col.abs() > 3]  #values > 3 or < -3

data[(data.abs() > 3).any(axis="columns")] #all rows with outliers
data[data.abs() > 3] = np.sign(data) * 3

**PERMUTATION AND SAMPLING**
- `permutation`-> Reordering of rows or columns in dataframe/series in a random order
- `sampling`-> Random selection of rows and columns from series/dataframe .It is of two types with or without replacement

In [None]:
df = pd.DataFrame(np.arange(5*7).reshape((5,7)))
sampler_for_rows = np.random.permutation(5)
sampler_for_columns = np.random.permutation(7)
df.take(sampler_for_rows) #default axis rows
df.take(sampler_for_columns , axis="columns")

df.sample(n=4 , axis="columns" ,replace=True) #default axis is rows , default replace is False

**DUMMY VARIABLES**
- Basic Dummies: Categorical data (like "a", "b", "c") converting to 0 and 1
- Prefix: to understand the new columns for less confusion
- Multiple Categories: when a row has more than one data(like "Action|Comedy"), for this use str.get_dummies
- dummies from numbers: By categories numbers into range , then make their dummy variables

In [None]:
df = pd.DataFrame({"key": ["b", "b", "a", "c", "a", "b"],"data1": range(6)})
dummy_variable = pd.get_dummies(df["key"],prefix="key") #prefix is an optional argument
df_with_dummy = df.join(dummy_variable) #adding it with the original dataframe

mnames = ["movie_id", "title", "genres"]
movies = pd.read_table("E:/ME/WES MECK - DATA ANALYSIS/GIT CLONE/pydata-book/datasets/movielens/movies.dat", sep="::",header=None, names=mnames, engine="python")
dummy_variable = movies["genres"].str.get_dummies("|") #making the while making sure to divide by the separator
movies_dummy = movies.join(dummy_variable.add_prefix("genre_"))

np.random.seed(12345)
values = np.random.uniform(size=10)
bins = [0, 0.2, 0.4, 0.6, 0.8, 1]
dummy = pd.get_dummies(pd.cut(values , bins))

**EXTENSION DATA TYPES**
- Since python is built on numpy , when we create null values it sees it as np.nan which might create compatability issues because it can affects the data type of our data that is being stored
- <NA> is special value equal to pd.NA:
- dtype="Int64" is the shorthand way of writing Int data type extension
- Importance ->
    - Accuracy: deal correctly with missing values without changing the data type
    - Speed: works fast on bigger data sets
    - Memory:uses less memory ,specially for strings
    - Flexibility: Supports special data types linke time zones etc

| **Extension Type**    | **Description**                              |
|-----------------------|----------------------------------------------|
| `BooleanDtype`        | Nullable Boolean data, use `"boolean"` when passing as string |
| `CategoricalDtype`    | Categorical data type, use `"category"` when passing as string |
| `DatetimeTZDtype`     | Datetime with time zone                     |
| `Float32Dtype`        | 32-bit nullable floating point, use `"Float32"` when passing as string |
| `Float64Dtype`        | 64-bit nullable floating point, use `"Float64"` when passing as string |
| `Int8Dtype`           | 8-bit nullable signed integer, use `"Int8"` when passing as string |
| `Int16Dtype`          | 16-bit nullable signed integer, use `"Int16"` when passing as string |
| `Int32Dtype`          | 32-bit nullable signed integer, use `"Int32"` when passing as string |
| `Int64Dtype`          | 64-bit nullable signed integer, use `"Int64"` when passing as string |
| `UInt8Dtype`          | 8-bit nullable unsigned integer, use `"UInt8"` when passing as string |
| `UInt16Dtype`         | 16-bit nullable unsigned integer, use `"UInt16"` when passing as string |
| `UInt32Dtype`         | 32-bit nullable unsigned integer, use `"UInt32"` when passing as string |
| `UInt64Dtype`         | 64-bit nullable unsigned integer, use `"UInt64"` when passing as string |

In [None]:
s = pd.Series([1,2,3,None]) #because it has missing value pandas makes it a float dtype where as it should be a integer data type
s = pd.Series([1,2,3,None] , dtype= pd.Int64Dtype()) #this is new Int64 with capital I data type
s = pd.Series(["hello" , "world" , None]) #Now pandas makes it a 'object' data type
s = pd.Series(["hello" , "world" , None],dtype = pd.StringDtype()) #This makes it a string data type

df = pd.DataFrame({
    "A": [1, 2, None, 4],
    "B": ["one", "two", "three", None],
    "C": [False, None, False, True]
})
df["A"] = df["A"].astype("Int64")
df["B"] = df["B"].astype("string")
df["C"] = df["C"].astype("boolean")

**STRING MANUPILATION**
- Python built in string manupilation functions
-----------
| **Method**       | **Description**                                                                 |
|------------------|---------------------------------------------------------------------------------|
| `count`          | String mein substring kitni baar aaya, uski ginti return karta hai              |
| `endswith`       | True return karta hai agar string kisi suffix se khatam hoti hai               |
| `startswith`     | True return karta hai agar string kisi prefix se shuru hoti hai                |
| `join`           | Ek sequence (list/tuple) ko string delimiter ke saath jodta hai                |
| `index`          | Substring ka pehla index deta hai; nahi mila to `ValueError` raise karta hai   |
| `find`           | Substring ka pehla index deta hai; nahi mila to `-1` return karta hai          |
| `rfind`          | Substring ka aakhri index deta hai; nahi mila to `-1` return karta hai         |
| `replace`        | Ek substring ko dusre se replace karta hai                                     |
| `strip`          | Dono taraf se whitespace (spaces, newlines) hataata hai                        |
| `rstrip`         | Right side se whitespace hataata hai                                           |
| `lstrip`         | Left side se whitespace hataata hai                                            |
| `split`          | String ko delimiter ke basis pe list mein todta hai                            |
| `lower`          | Sab characters ko lowercase mein badalta hai                                   |
| `upper`          | Sab characters ko uppercase mein badalta hai                                   |
| `casefold`       | Lowercase mein badalta hai, aur region-specific characters ko bhi handle karta hai |
| `ljust`          | String ko left justify karta hai, right side mein spaces add karta hai         |
| `rjust`          | String ko right justify karta hai, left side mein spaces add karta hai         |

In [None]:
val = "a,b,  guide"

val.split(",") #returns a list
pieces = [x for x in val.strip()]
"::".join(pieces) #faster then the arithmatic method
val.index(",") #returns the first index of the given argument
val.find(",") #does the same work like index but it does not throw an error but -1
val.replace(",",":") #replaces the "," with ":"
val.count(",") #counts how many times "," comes in the string

**REGULAR EXPRESSIONS**
- Regex finds the pattern in the text and then on basis of that we can perfom multiple operations
- Three of its major works are -> pattern matching , substitution , splitting
- pattern expalanation 
    - [A-Z0-9._%+-]+: Username (letters, numbers, special chars).
    - @: @ symbol.
    - [A-Z0-9.-]+: Domain name.
    - \.[A-Z]{2,4}: Dot aur 2-4 letter suffix (com, net, etc.).

- working with groups just put parenthesis where you have defined the patter

| **Method**  | **Description**                                                                                   |
|-------------|---------------------------------------------------------------------------------------------------|
| `findall`   | String mein saare non-overlapping matches ko list mein return karta hai                          |
| `finditer`  | `findall` jaisa, lekin iterator return karta hai                                                 |
| `match`     | String ke shuru mein pattern match karta hai; groups bhi de sakta hai; nahi mila to `None`       |
| `search`    | String mein kahin bhi pehla match dhundta hai, match object deta hai                             |
| `split`     | String ko pattern ke basis pe pieces mein todta hai                                              |
| `sub`       | Pattern ko replacement string se replace karta hai; groups (`\1`, `\2`) use kar sakte hain      |
| `subn`      | `sub` jaisa, lekin kitne replacements hue, yeh bhi batata hai (tuple mein)                       |

In [None]:
import re
text = "foo    bar\t baz  \tqux"
re.split(r"\s+", text) #we told it the pattern and it splits text based on that
regex = re.compile(r"\s+") #made a regex object to process it faster for reuse

text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com"""
pattern = r"[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}"
regex = re.compile(pattern, flags=re.IGNORECASE)
regex.findall(text) #finds all matching patterns and returns a list of them
m = regex.search(text) #gives the first pattern match in the text
text[m.start():m.end()] #start() and end() give the index positions of the match
regex.match(text) #None -> because it only matches at the start of the string
regex.sub("reducted", text) #substitutes 'reducted' wherever the pattern is found

pattern = r"([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})"
regex = re.compile(pattern, flags=re.IGNORECASE)
regex.findall(text) #list of tuples with all parts of the matched patterns
m = regex.search(text) #finds first match, not None, use groups() for parts
m.groups() #this returns a tuple of the matched parts, not a single string
print(regex.match(text)) #prints None because pattern doesn't match at start
regex.sub(r"Username: \1, Domain: \2, Suffix: \3", text) #\1 \2 \3 are the pattern groups

**STRING FUNCTIONS IN PANDAS**

| Method      | Description  |
|------------|-------------|
| **cat**        | Concatenate strings element-wise with optional delimiter |
| **contains**   | Return Boolean array if each string contains pattern/regex |
| **count**      | Count occurrences of pattern |
| **extract**    | Use a regular expression with groups to extract one or more strings from a Series of strings; the result will be a DataFrame with one column per group |
| **endswith**   | Equivalent to `x.endswith(pattern)` for each element |
| **startswith** | Equivalent to `x.startswith(pattern)` for each element |
| **findall**    | Compute list of all occurrences of pattern/regex for each string |
| **get**        | Index into each element (retrieve i-th element) |
| **isalnum**    | Equivalent to built-in `str.isalnum` |
| **isalpha**    | Equivalent to built-in `str.isalpha` |
| **isdecimal**  | Equivalent to built-in `str.isdecimal` |
| **isdigit**    | Equivalent to built-in `str.isdigit` |
| **islower**    | Equivalent to built-in `str.islower` |
| **isnumeric**  | Equivalent to built-in `str.isnumeric` |
| **isupper**    | Equivalent to built-in `str.isupper` |
| **join**       | Join strings in each element of the Series with passed separator |
| **len**        | Compute length of each string |
| **lower, upper** | Convert cases; equivalent to `x.lower()` or `x.upper()` for each element |
| **match**      | Use `re.match` with the passed regular expression on each element, returning `True` or `False` whether it matches |
| **pad**        | Add whitespace to left, right, or both sides of strings |
| **center**     | Equivalent to `pad(side="both")` |
| **repeat**     | Duplicate values (e.g., `s.str.repeat(3)` is equivalent to `x * 3` for each string) |
| **replace**    | Replace occurrences of pattern/regex with some other string |
| **slice**      | Slice each string in the Series |
| **split**      | Split strings on delimiter or regular expression |
| **strip**      | Trim whitespace from both sides, including newlines |
| **rstrip**     | Trim whitespace on right side |
| **lstrip**     | Trim whitespace on left side |


In [None]:
data = pd.Series({"Dave": "dave@google.com","Steve": "steve@gmail.com","Rob": "rob@gmail.com","Wes": np.nan})
data.str.contains("gmails") #to check if it contains the following
data_string = data.astype("string") #To convert from object dtype to string dtype

pattern = r"([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})"
import re
data.str.findall(pattern , flags=re.IGNORECASE)

match = data.str.findall(pattern , flags=re.IGNORECASE).str[0] #to get the first match of the pattern(a tuple of parts)
match.str.get(1)  # to get the part at index 1 in the tuple

data.str.extract(pattern , flags=re.IGNORECASE) #It makes the dataframe

**CATEGORICAL DATA - BACKGROUNG AND MOTIVATION**
- When in a columns there are repeated string we can assign them some code(numbers) which are gonna take less space because string themselves take too much space
    - this is called categorical or dictionary encoded representation
- Categorical extension types in pandas
-------
you can do it with any immutable objects not just with strings


In [None]:
values = pd.Series(["apple","orange","apple","apple"]*2) #column with repeated strings
values.unique() #["apple","orange"]

dim = pd.Series(values.unique())
values = pd.Series([1,0,1,1]*2) #making codes to represent each categories
dim.take(values) #This helps in saving the space in memory

In [None]:
fruits = ['apple', 'orange', 'apple', 'apple'] * 2
rng = np.random.default_rng(seed=12345)
df = pd.DataFrame({
    'fruit': fruits,
    'basket_id': np.arange(len(fruits)),
    'count': rng.integers(3, 15, size=len(fruits)),
    'weight': rng.uniform(0, 4, size=len(fruits))
})

fruits_cat = df["fruit"].astype("category") #We have converted the copy of that 
#column into categorical data which takes less space in memory,this is not a string anymore 
c = fruits_cat.array #checking whether is is categorical data type
type(c) #pandas.core.arrays.categorical.Categorical

c.categories
c.codes
dict(enumerate(c.categories)) #to map the codes and cotegories for understanding

In [None]:
"""JUST LIKE ABOVE THERE ARE MORE WAYS TO MAKE CATEGORICAL DATA TYPES"""
#WITH SEQUENCE
My_categorical = pd.Categorical(['foo', 'bar', 'baz', 'foo', 'bar'])

#WITH CODES
categories = ['foo', 'bar', 'baz']
codes = [0, 1, 2, 0, 0, 1]
my_cat = pd.Categorical.from_codes(codes , categories)

#ORDERED -> like 'foo' < 'bar' < 'baz'
my_cat.as_ordered() #method 1
my_cat = pd.Categorical.from_codes(codes , categories , ordered=True) #method 2