<a href="https://colab.research.google.com/github/Amarsinh0/MY-NOTES/blob/main/stepwise_eda_notes_slide_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **EDA (stepwise)**

# **1st step:-** The first step in exploratory analysis is **reading** in the data and then exploring the variables

titanic_train = pd.read_csv("../input/train.csv")      # Read the data

---


---
## **2nd step:** It's a good idea to start off by checking the dimensions of your data set with df.shape and the variable data types of **df.dtypes.**

---



titanic_train.shape  

---
titanic_train.dtypes

---

titanic_train.head(5)

---


---

## **3rd step:**


After getting a sense of the data's structure, it is a good idea to look at a statistical summary of the variables with   **df.describe():** 

titanic_train.describe()

---
Notice that non-numeric columns are dropped from the statistical summary provided by df.describe().

---

We can get a summary of the categorical variables by passing only those columns to describe():


categorical = titanic_train.dtypes[titanic_train.dtypes == "object"].index
print(categorical)

titanic_train[categorical].describe()



---



---

# **4th step**
# **questions** :

---



After looking at the data for the first time, you should ask yourself a few questions:

1)**Do I need all of the variables?**

since dropping variables reduces complexity and can make computation on the data faster

del titanic_train["PassengerId"]     # Remove PassengerId

As you might have noticed, removing variables is often more of an art than a science. It is easiest to start simple: don't be afraid to remove (or simply ignore) confusing, messy or otherwise troublesome variables temporarily when you're just getting starting with an analysis or predictive modeling task. Data projects are iterative processes: you can start with a simple analysis or model using only a few variables and then expand later by adding more and more of the other variables you initially ignored or removed.

---
---


**2)Should I transform any variables?**

When you first load a data set, some of the variables may be encoded as data types that don't fit well with what the data really is or what it means.



**for example 1)** survived data Variables that indicate a state or the presence or absence of something with the numbers 0 and 1 are sometimes called indicator variables or dummy variables (0 indicates absence and 1 indicates presence.)

We could instead encode Survived as a categorical variable with more descriptive categories:

**example 2)**    Pclass is an integer that indicates a passenger's class, with 1 being first class, 2 being second class and 3 being third class. Passenger class is a category, so it doesn't make a lot of sense to encode it as a numeric variable. What's more 1st class would be considered "above" or "higher" than second class, but when encoded as an integer, 1 comes before 2. We can fix this by transforming Pclass into an ordered categorical variable.
 
 **code:**  
 new_Pclass = pd.Categorical(titanic_train["Pclass"],
                           ordered=True)

new_Pclass = new_Pclass.rename_categories(["Class1","Class2","Class3"])     

new_Pclass.describe()


new_survived = pd.Categorical(titanic_train["Survived"])
new_survived = new_survived.rename_categories(["Died","Survived"])              

new_survived.describe()

---
---
3)Are there NA values, outliers or other strange values?

Data sets are often littered with missing data, extreme data points called outliers and other strange values. Missing values, outliers and strange values can negatively affect statistical tests and models and may even cause certain functions to fail.

In Python, you can detect missing values with the pd.isnull() function:


Detecting missing values is the easy part: it is far more difficult to decide how to handle them. In cases where you have a lot of data and only a few missing values, it might make sense to simply delete records with missing values present. 

On the other hand, if you have more than a handful of missing values, removing records with missing values could cause you to get rid of a lot of data. Missing values in categorical data are not particularly troubling because you can simply treat NA as an additional category.

Missing values in numeric variables are more troublesome, since you can't just treat a missing value as number. As it happens, the Titanic dataset has some NA's in the Age variable:


Notice the count of age(712) is less than the total row count of hte data set(889). This indicates missing data. We can get the row indexes of the missing values with np.where():

code:

missing = np.where(titanic_train["Age"].isnull() == True)
missing

**how to solve missing value problem:**

With 177 missing values it's probably not a good idea to throw all those records away. Here are a few ways we could deal with them:

1)Replace the null values with 0s

2)Replace the null values with some central value like the mean or median

3)Impute some other value

4)Split the data set into two parts: one set with where records have an Age value and another set where age is null.



Setting missing values in numeric data to zero makes sense in some cases, but it doesn't make any sense here because a person's age can't be zero. Setting all ages to some central number like the median is a simple fix but there's no telling whether such a central number is a reasonable estimate of age without looking at the distribution of ages. For all we know each age is equally common. We can quickly get a sense of the distribution of ages by creating a histogram of the age variable with df.hist():

**use histogram to understand missing value with plot**

**for example:**

From the histogram, we see that ages between 20 and 30 are the most common, so filling in missing values with a central number like the mean or median wouldn't be entirely unreasonable. Let's fill in the missing values with the median value of 28:

**use imputation** :

In practice, imputing the missing data (estimating age based on other variables) might have been a better option, but we'll stick with this for now.
---
---




Next, let's consider outliers. Outliers are extreme numerical values: values that lie far away from the typical values a variable takes on. Creating plots is one of the quickest ways to detect outliers. For instance, the histogram above shows that 1 or 2 passengers were near age 80. Ages near 80 are uncommon for this data set, but in looking at the general shape of the data seeing one or two 80 year olds doesn't seem particularly surprising.

**use boxplot for understand spred of outliers**

This time we'll use a boxplot, since boxplots are designed to show the spread of the data and help identify outliers


n a boxplot, the central box represents 50% of the data and the central bar represents the median. The dotted lines with bars on the ends are "whiskers" which encompass the great majority of the data and points beyond the whiskers indicate uncommon values. In this case, we have some uncommon values that are so far away from the typical value that the box appears squashed in the plot: this is a clear indication of outliers. Indeed, it looks like one passenger paid almost twice as much as any other passenger. Even the passengers that paid between 200 and 300 are far higher than the vast majority of the other passengers.



Data sets can have other strange values beyond missing values and outliers that you may need to address. Sometimes data is mislabeled or simply erroneous; bad data can corrupt any sort of analysis so it is important to address these sorts of issues before doing too much work.


4)Should I create new variables?

The variables present when you load a data set aren't always the most useful variables for analysis. Creating new variables that are derivations or combinations existing ones is a common step to take before jumping into an analysis or modeling task.

For example, imagine you are analyzing web site auctions where one of the data fields is a text description of the item being sold. A raw block of text is difficult to use in any sort of analysis, but you could create new variables from it such as a variable storing the length of the description or variables indicating the presence of certain keywords.

Creating a new variable can be as simple as taking one variable and adding, multiplying or dividing by another. Let's create a new variable, Family, that combines SibSp and Parch to indicate the total number of family members (siblings, spouses, parents and children) a passenger has on board: