## Data Understanding

Data understanding is one of the key phases of CRISP-ML(Q). It involves getting familiar with the available data, assessing its quality, and gaining insights into its structure and content. The goal of this phase is to establish a solid foundation for subsequent data mining tasks.

Here are the main steps involved in the data understanding phase of CRISP-DM:

1. Initial Data Collection: In this step, you gather all the relevant data sources that are available for your analysis. This may include databases, files, spreadsheets, or any other data repositories.

2. Describe Data: You examine the collected data to gain an understanding of its general properties. This involves summarizing the data's format, size, and the number of records or instances. It also includes identifying the data types (numeric, categorical, etc.) and the presence of missing values or outliers.

3. Explore Data: Here, you perform a more in-depth exploration of the data to uncover patterns, relationships, or interesting characteristics. This can be done using statistical analysis, visualization techniques, or data profiling tools. The objective is to identify potential areas of interest and generate hypotheses for further analysis.

4. Verify Data Quality: Data quality is crucial for accurate analysis. In this step, you assess the quality of the data by examining its completeness, consistency, accuracy, and relevance. This may involve checking for duplicate records, data integrity issues, or inconsistencies across different sources.

5. Data Familiarization: It's important to understand the meaning and context of the data attributes. This step involves studying the metadata, data dictionaries, or any available documentation to gain insights into the semantics of the data. Domain knowledge or consultation with subject matter experts can also be useful in interpreting the data.

6. Identify Data Relationships: During this phase, you investigate the relationships between different data attributes or variables. This can be achieved through correlation analysis, data profiling, or exploratory data analysis techniques. Understanding these relationships can help you formulate appropriate hypotheses for modeling.

7. Data Sampling: Depending on the size and complexity of the data, you may choose to create a representative sample to work with. Sampling can help reduce processing time and resource requirements while still providing reliable insights.

By completing these steps, you develop a comprehensive understanding of the data you're working with. This knowledge serves as a solid foundation for subsequent phases of the CRISP-DM methodology, such as data preparation, modeling, and evaluation.

### A real life scenario

Let's consider a real-life example of data understanding in the context of customer relationship management (CRM) for an e-commerce company:

1. Initial Data Collection: The company collects various types of customer data, such as purchase history, demographics, website browsing behavior, and customer support interactions. These data sources include transactional databases, web analytics tools, and customer service logs.

2. Describe Data: The collected data includes information such as customer ID, purchase amount, product categories, customer age, gender, website clickstream data, and customer feedback. The data consists of millions of records and spans multiple years.

3. Explore Data: Through data exploration techniques, the company discovers that certain product categories are frequently purchased together, and there is a higher propensity for repeat purchases among customers who engage with personalized recommendations on the website.

4. Verify Data Quality: The company examines the data for any data quality issues. They identify missing values in the customer age field and handle them by imputing values based on statistical methods. They also find some duplicate customer records and remove them from the dataset.

5. Data Familiarization: The company reviews the metadata and data dictionaries to understand the meaning of each attribute. They find that some attributes require further clarification, so they consult with the customer support team to gain a better understanding of the data semantics.

6. Identify Data Relationships: By performing correlation analysis, the company discovers that customer age is positively correlated with the average purchase amount, indicating that older customers tend to spend more. They also observe a negative correlation between website page load time and conversion rates, highlighting the importance of optimizing website performance.

7. Data Sampling: Given the large dataset, the company decides to create a representative sample of customers for a specific analysis. They randomly select 10,000 customer records to work with, ensuring that the sample retains the key characteristics and patterns present in the full dataset.

By going through these steps, the company gains a deep understanding of their customer data, identifies valuable insights, and addresses data quality issues. This understanding enables them to proceed with data preparation, modeling, and ultimately, leveraging the insights for effective customer relationship management strategies, such as personalized marketing campaigns or targeted recommendations.

### Continuous Data vs Discrete Data

Continuous data refers to information that can take on any value within a specific range. It can be measured and divided into smaller intervals, and there are no distinct boundaries between the values. Examples of continuous data include temperature, weight, height, and time. These values can be infinitely divided, and they are typically represented using decimal numbers.


On the other hand, discrete data consists of separate, distinct values or categories. It often represents counts or whole numbers and cannot be divided into smaller parts. Examples of discrete data include the number of students in a class, the number of cars in a parking lot, and the results of a survey with options like "yes" or "no." Discrete data is typically represented using integers or categorical variables.

### Types of Continuous Data

1. Interval Data: Interval data represents measurements where the intervals between values are consistent and meaningful, but there is no true zero point. The zero in interval data does not indicate the absence of the quantity being measured. Examples of interval data include temperature measured in Celsius or Fahrenheit, where zero does not indicate the absence of heat.

2. Ratio Data: Ratio data, similar to interval data, has consistent and meaningful intervals between values. However, it also possesses a true zero point, indicating the absence of the measured quantity. Ratio data allows for meaningful ratios and comparisons. Examples include height, weight, time duration, and income. In ratio data, zero represents a complete absence of the quantity being measured.

3. Time Data: Time data represents measurements related to time, such as hours, minutes, seconds, or timestamps. Time data is typically considered continuous because it can be divided into smaller intervals, and any point in time can be expressed to a greater level of precision.

4. Measurement Data: Measurement data includes any data that can be quantitatively measured, such as length, width, volume, or area. These measurements are typically continuous and can take on any value within a given range.

5. Sensor Data: Sensor data refers to continuous measurements collected from various sensors, devices, or instruments. This can include data from temperature sensors, pressure sensors, humidity sensors, or any other sensor that provides continuous numerical readings.

It's important to note that continuous data can be represented as decimal numbers and can theoretically take on infinite possible values within a given range. Understanding the type of continuous data you are working with is crucial for selecting appropriate statistical analysis techniques and interpreting the results accurately.

### Types of Discrete Data

Discrete data refers to information that can only take specific, separate values. These values are often counted or categorized rather than measured on a continuous scale. Here are some common types of discrete data:

1. Binary Data: Binary data has only two possible outcomes or categories. It represents a yes/no, true/false, or presence/absence situation. Examples include answers to yes/no questions (e.g., Did you pass the exam?), or gender (male/female).

2. Count Data: Count data represents the number of occurrences or frequency of events within a given category. It involves whole numbers that cannot be negative or fractional. Examples include the number of cars in a parking lot, the number of students in a class, or the number of books on a shelf.

3. Categorical Data: Categorical data consists of distinct groups or classes that do not have any inherent order. However, unlike nominal data, it may involve multiple categories or levels. Examples include types of pets (dog, cat, bird), marital status (single, married, divorced), or eye color (blue, brown, green).

It's important to note that discrete data is different from continuous data, which can take any value within a given range and is typically measured rather than counted or categorized.

### Types of Categorical Data

Categorical data, also known as qualitative data, refers to information that can be grouped into specific categories or classes. It represents characteristics or qualities that are non-numeric in nature. Categorical data is often collected through observations, surveys, or interviews, and it helps in understanding and describing different attributes or characteristics of a population or sample.

Categorical data can be further divided into two types:

Nominal Data: Nominal data consists of categories or labels with no inherent order or ranking. Each category is distinct and does not have any numerical significance. Examples of nominal data include gender (male/female), types of cars (sedan, SUV, hatchback), or cities (New York, London, Tokyo).

Ordinal Data: Ordinal data includes categories that have a natural order or ranking. The categories can be ranked based on some criteria, but the differences between the categories may not be quantitatively measurable or equal. Examples of ordinal data include rating scales (e.g., 1 star, 2 stars, 3 stars), educational levels (e.g., elementary, middle school, high school), or survey responses (e.g., strongly agree, agree, neutral, disagree, strongly disagree).

Categorical data is often analyzed using frequency distributions, bar charts, pie charts, or contingency tables. These methods help in visualizing and summarizing the distribution of data across different categories, making it easier to draw conclusions or identify patterns within the dataset. Statistical techniques such as chi-square tests or logistic regression can also be used to analyze categorical data and explore relationships between variables.