Q1. What is data encoding? How is it useful in data science?

    Encoding is the process of converting the data or a given sequence of characters, symbols, alphabets etc., into a specified format, for the secured transmission of data.

    Data encoding refers to the process of transforming categorical or text data into a numerical representation that can be processed by machine learning algorithms. This is necessary because many machine learning algorithms can only process numerical data, and cannot work with categorical data directly. There are several methods for data encoding, including label encoding, one-hot encoding, and target encoding, among others. 

    Data encoding is useful in data science because it allows machine learning algorithms to work with categorical and text data, which are common in many real-world applications. It helps to capture the underlying structure and relationships in the data, and can improve the accuracy and performance of machine learning models. Data encoding also helps to reduce the size of the dataset, as numerical data takes up less memory than categorical or text data.

Q2. What is nominal encoding? Provide an example of how you would use it in a real-world scenario.

    Nominal encoding is a type of data encoding that assigns a unique integer to each category in a categorical variable, without any inherent ordering or ranking. 

    An example of nominal encoding in a real-world scenario is classification problem where we have a categorical variable representing different types of fruits, such as "apple", "banana", and "orange". Nominal encoding can be used to convert these categories into numerical values, which can be fed into a machine learning algorithm for classification.

Q3. In what situations is nominal encoding preferred over one-hot encoding? Provide a practical example.

    Nominal encoding is preferred over one-hot encoding when the number of categories in a categorical variable is large, and when the categories are not mutually exclusive. One-hot encoding creates a binary vector for each category, with a value of 1 for the corresponding category and 0 for all other categories. This can lead to a high number of features and a sparse matrix, which can increase the complexity of the machine learning algorithm and make it harder to interpret the results.

    An example of when nominal encoding is preferred over one-hot encoding is in natural language processing, specifically in the case of text classification. In this scenario, we might have a categorical variable representing the topic or category of a given document, such as "politics", "sports", "entertainment", "technology", and so on. If we were to use one-hot encoding to represent these categories, we would end up with a high number of features, which can be computationally expensive to process. Additionally, some documents may be related to multiple categories, which can make it difficult to determine which category to assign the document to using one-hot encoding. In this case, nominal encoding can be used to represent the categories as unique integers, which can be more efficient and easier to interpret.

Q4. Suppose you have a dataset containing categorical data with 5 unique values. Which encoding
technique would you use to transform this data into a format suitable for machine learning algorithms?
Explain why you made this choice.

    Assuming there is no natural ordering or ranking between the categories and the number of categories is relatively small (5 unique values), "nominal encoding" can be a suitable choice. Nominal encoding will assign a unique integer to each category, without any inherent ordering or ranking, which will allow the categorical data to be represented numerically in a compact and efficient way. Additionally, nominal encoding will preserve any information about the frequency or distribution of the categories, which can be useful in some machine learning algorithms.

Q5. In a machine learning project, you have a dataset with 1000 rows and 5 columns. Two of the columns
are categorical, and the remaining three columns are numerical. If you were to use nominal encoding to
transform the categorical data, how many new columns would be created? Show your calculations.

    Nominal encoding, also known as one-hot encoding, converts each unique value in a categorical column into a new binary column, with a value of 1 indicating that the row has that category, and 0 indicating that it does not.

    So, for each categorical column, we need to create as many new binary columns as there are unique categories, minus one (since one of the categories can be inferred from the absence of all the others).   

    Assuming that the first categorical column has 4 unique categories and the second categorical column has 6 unique categories, the number of new columns created by nominal encoding can be calculated as follows:

    For the first categorical column:

    4 unique categories minus 1 = 3 new columns created
    For the second categorical column:

    6 unique categories minus 1 = 5 new columns created
    So, the total number of new columns created by nominal encoding would be:

    3 + 5 = 8 new columns created

    Therefore, after nominal encoding, the dataset would have 1000 rows and 8 + 3 = 11 columns.

Q6. You are working with a dataset containing information about different types of animals, including their
species, habitat, and diet. Which encoding technique would you use to transform the categorical data into
a format suitable for machine learning algorithms? Justify your answer.

    Since there is no inherent order in the categories of the features, we use one-hot encoding. One-hot encoding will create binary vectors for each feature, which can be used as input to machine learning algorithms.

    One-hot encoding is a popular technique for handling categorical data in machine learning. This technique creates a binary vector for each categorical feature, where each vector has a length equal to the number of categories in the feature. The vector has a value of 1 in the position corresponding to the category and 0 in all other positions.

Q7.You are working on a project that involves predicting customer churn for a telecommunications company. You have a dataset with 5 features, including the customer's gender, age, contract type, monthly charges, and tenure. Which encoding technique(s) would you use to transform the categorical data into numerical data? Provide a step-by-step explanation of how you would implement the encoding.

    The Categorical features gender and contract type has no inherent order , so we use one-hot encoding to transform the categorical feature into numerical data. 

    Steps to implement encoding:
    1. import skicit-learn library and pandas library.
        from sklearn.preprocessing import OneHotEncoder
        import pandas as pd

    2. Load the dataset into pandas dataframe.
        (df = pd.read_csv('telecom_churn.csv'))

    3. Identify the categorical columns that need to be one-hot encoded.
        (cat_cols = ['gender', 'contract_type'])

    4. Create the instance of OneHotEncoder
        encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
        handle_unknown='ignore' is used to ignore new categories that may appear in the test set, and sparse=False is used to return a dense matrix.

    5. Fit the encoder on the categorical columns.
        encoder.fit(df[cat_cols])

    6. Transform the categorical columns into one-hot encoded columns.
        encoded_cols = pd.DataFrame(encoder.transform(df[cat_cols]))
    
    7. Replace the original categorical columns with the one-hot encoded columns.
        df = pd.concat([df, encoded_cols], axis=1)
        df.drop(cat_cols, axis=1, inplace=True)

    8. Rename the one-hot encoded columns for clarity.
        encoded_cols_names = encoder.get_feature_names(cat_cols)
        df.columns = list(df.columns[:-len(cat_cols)]) + list(encoded_cols_names)