# Task
Analyze the provided dataset "/content/sorting hat - Sheet1 (5).csv" to build an AI system that sorts individuals into Hogwarts houses based on their personality, choices, and responses.

## Load the data

### Subtask:
Load the dataset into a pandas DataFrame.


**Reasoning**:
Import the pandas library and load the dataset into a pandas DataFrame. Then, display the first 5 rows and the info of the DataFrame to understand its structure and data types.



In [None]:
import pandas as pd

df = pd.read_csv('/content/sorting hat - Sheet1 (5).csv')
display(df.head())
display(df.info())

Unnamed: 0,HOUSE,WRITE A SENTENCE OR DESCRIPTION ABOUT YOURSELF
0,hufflepuff,I like to ride bikes
1,gryffindor,I am honest and brave
2,ravenclaw,I study hard and am very diligent in my work
3,gryffindor,I faced my fears today
4,slytherin,"My sister cried when I called her stupid, it's..."


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 515 entries, 0 to 514
Data columns (total 2 columns):
 #   Column                                          Non-Null Count  Dtype 
---  ------                                          --------------  ----- 
 0   HOUSE                                           515 non-null    object
 1   WRITE A SENTENCE OR DESCRIPTION ABOUT YOURSELF  515 non-null    object
dtypes: object(2)
memory usage: 8.2+ KB


None

## Explore and preprocess the data

### Subtask:
Analyze the data to understand its structure, identify relevant columns, and handle any missing values or inconsistencies.


**Reasoning**:
Examine the unique values in the 'HOUSE' column, check for inconsistencies, analyze the text column for patterns and cleaning needs, and determine if there are any missing values.



In [None]:
display(df['HOUSE'].unique())
display(df['HOUSE'].value_counts())

# Analyze the text data - just a quick look for now
display(df['WRITE A SENTENCE OR DESCRIPTION ABOUT YOURSELF'].head())

# Check for missing values
display(df.isnull().sum())

array(['hufflepuff', 'gryffindor', 'ravenclaw', 'slytherin'], dtype=object)

Unnamed: 0_level_0,count
HOUSE,Unnamed: 1_level_1
gryffindor,148
slytherin,145
hufflepuff,114
ravenclaw,108


Unnamed: 0,WRITE A SENTENCE OR DESCRIPTION ABOUT YOURSELF
0,I like to ride bikes
1,I am honest and brave
2,I study hard and am very diligent in my work
3,I faced my fears today
4,"My sister cried when I called her stupid, it's..."


Unnamed: 0,0
HOUSE,0
WRITE A SENTENCE OR DESCRIPTION ABOUT YOURSELF,0


**Reasoning**:
Based on the initial analysis, the 'HOUSE' column has consistent values. The text column appears to contain free-form text requiring standard text cleaning. There are no missing values. The next step is to outline necessary preprocessing steps for the text data.



In [None]:
# Outline necessary preprocessing steps for the text data
print("Preprocessing steps for text data ('WRITE A SENTENCE OR DESCRIPTION ABOUT YOURSELF'):")
print("1. Convert text to lowercase.")
print("2. Remove punctuation.")
print("3. Consider removing common English stop words (optional, depending on model).")
print("4. Tokenization (splitting text into words).")
print("5. Lemmatization or Stemming (reducing words to their base or root form) (optional).")

Preprocessing steps for text data ('WRITE A SENTENCE OR DESCRIPTION ABOUT YOURSELF'):
1. Convert text to lowercase.
2. Remove punctuation.
3. Consider removing common English stop words (optional, depending on model).
4. Tokenization (splitting text into words).
5. Lemmatization or Stemming (reducing words to their base or root form) (optional).


## Feature engineering

### Subtask:
Extract relevant features from the text data using techniques like TF-IDF or word embeddings.


**Reasoning**:
Extract features from the text data using TF-IDF as instructed.



In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english', lowercase=True)
X = vectorizer.fit_transform(df['WRITE A SENTENCE OR DESCRIPTION ABOUT YOURSELF'])

display(X.shape)

(515, 810)

## Choose and train a model

### Subtask:
Select a suitable machine learning model for classification (e.g., a sentiment analysis model or a personality prediction model) and train it on the prepared data.


**Reasoning**:
Split the data into training and testing sets and then train a Logistic Regression model on the training data.



In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, df['HOUSE'], test_size=0.2, random_state=42)

# Instantiate and train the model
model = LogisticRegression()
model.fit(X_train, y_train)

print("Model training complete.")

Model training complete.


## Evaluate the model

### Subtask:
Assess the performance of the trained model using appropriate evaluation metrics.


**Reasoning**:
Import the necessary evaluation metrics and calculate the accuracy, precision, recall, and F1-score using the trained model and the test data.



In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1:.4f}")

Accuracy: 0.8544
Precision: 0.8617
Recall: 0.8544
F1-score: 0.8535


## Implement the sorting system

### Subtask:
Create a function or class that takes new input (personality, choices, responses) and uses the trained model to predict the most suitable Hogwarts house.


**Reasoning**:
Define a function that takes a string input, transforms it using the pre-trained vectorizer, and predicts the house using the pre-trained model.



In [None]:
def predict_house(description):
  """
  Predicts the Hogwarts house based on a personality description.

  Args:
    description: A string containing the personality description.

  Returns:
    The predicted Hogwarts house as a string.
  """
  # Transform the input description using the trained vectorizer
  description_vector = vectorizer.transform([description])

  # Predict the house using the trained model
  predicted_house = model.predict(description_vector)

  return predicted_house[0]

# Example usage (optional, for testing)
# test_description = "I value bravery and courage above all else."
# predicted = predict_house(test_description)
# print(f"The predicted house is: {predicted}")

## Summary:

### Data Analysis Key Findings

*   The dataset contains 515 entries with two columns: 'HOUSE' and 'WRITE A SENTENCE OR DESCRIPTION ABOUT YOURSELF'.
*   There are no missing values in the dataset.
*   The 'HOUSE' column contains four unique values: 'hufflepuff', 'gryffindor', 'ravenclaw', and 'slytherin'.
*   The text data in the 'WRITE A SENTENCE OR DESCRIPTION ABOUT YOURSELF' column was transformed into a numerical representation using TF-IDF, resulting in 810 unique features.
*   A Logistic Regression model was trained on 80% of the data and achieved an accuracy of approximately 85.44% on the remaining 20% test data.
*   The weighted precision, recall, and F1-score of the model were approximately 86.17%, 85.44%, and 85.35%, respectively.

### Insights or Next Steps

*   The trained model demonstrates a reasonable ability to predict Hogwarts houses based on the provided descriptions.
*   Further model performance could be explored by trying different classification algorithms or refining the text preprocessing and feature engineering steps.


# Task
Create a user input chatbot with a quiz based on the provided dataset "/content/sorting hat - Sheet1 (5).csv" to predict a Hogwarts house.

## Create a function to ask quiz questions

### Subtask:
Define a function that presents questions to the user and collects their responses.


**Reasoning**:
Define the `ask_quiz_questions` function as described in the instructions to collect user responses to a list of questions.



In [None]:
def ask_quiz_questions(questions):
  """
  Presents questions to the user and collects their responses.

  Args:
    questions: A list of strings, where each string is a question.

  Returns:
    A list of strings, where each string is the user's response to a question.
  """
  responses = []
  for question in questions:
    response = input(question + "\n")
    responses.append(response)
  return responses

## Process user responses

### Subtask:
Process the collected user responses to create a single text input for the prediction model.


**Reasoning**:
Define the `process_responses` function as instructed to concatenate the list of user responses into a single string.



In [None]:
def process_responses(responses):
  """
  Processes a list of user responses into a single text input for prediction.

  Args:
    responses: A list of strings, where each string is a user response.

  Returns:
    A single string containing all responses concatenated with spaces.
  """
  return " ".join(responses)

## Integrate with the prediction model

### Subtask:
Use the `predict_house` function (already defined) to predict the house based on the processed user input.


**Reasoning**:
Use the defined `predict_house` function with the processed user input to get the house prediction.



In [None]:
# Assuming 'processed_response' is the variable holding the processed user input string
# For demonstration purposes, let's create a dummy processed_response
processed_response = "I am brave and loyal"

predicted_house = predict_house(processed_response)
print(f"Based on your responses, your predicted house is: {predicted_house}")

Based on your responses, your predicted house is: gryffindor


## Create the chatbot interface

### Subtask:
Implement a loop that guides the user through the quiz and displays the predicted house.


**Reasoning**:
Implement a loop to guide the user through the quiz, collect responses, process them, predict the house, and display the result, with an option to exit.



In [None]:
def run_quiz():
  """Runs the interactive Hogwarts sorting quiz."""
  questions = [
      "Describe a situation where you had to make a difficult choice.",
      "What quality do you value most in yourself?",
      "How do you react under pressure?",
      "What is your greatest ambition?",
      "If you found a lost item, what would you do?"
  ]

  print("Welcome to the Hogwarts Sorting Quiz!")

  while True:
    user_responses = ask_quiz_questions(questions)
    processed_responses = process_responses(user_responses)
    predicted_house = predict_house(processed_responses)

    print(f"\nBased on your responses, your predicted house is: {predicted_house}")

    play_again = input("Would you like to take the quiz again? (yes/no): ").lower()
    if play_again != 'yes':
      print("Thank you for taking the quiz!")
      break

run_quiz()

Welcome to the Hogwarts Sorting Quiz!
Describe a situation where you had to make a difficult choice.
I attended college during a mass bunk
What quality do you value most in yourself?
Standing up against wrong actions
How do you react under pressure?
I keep myself calm
What is your greatest ambition?
Become a great human being
If you found a lost item, what would you do?
Keep it with myself till someone asks for it

Based on your responses, your predicted house is: hufflepuff
Would you like to take the quiz again? (yes/no): no
Thank you for taking the quiz!


## Summary:

### Data Analysis Key Findings

*   The `ask_quiz_questions` function successfully prompts the user for input based on a list of questions and stores the responses.
*   The `process_responses` function effectively concatenates a list of user responses into a single string separated by spaces, suitable for model input.
*   A pre-defined `predict_house` function is utilized to predict a Hogwarts house based on the processed user input string.
*   A loop is implemented to guide the user through the quiz, presenting questions, processing responses, predicting the house, and offering to repeat the quiz.

### Insights or Next Steps

*   The current implementation assumes the `predict_house` function is readily available. A crucial next step is to define and train this function using the provided dataset to enable actual house prediction.
*   Consider adding input validation within `ask_quiz_questions` to handle unexpected user inputs gracefully and potentially guide the user towards providing more suitable responses.
