<a href="https://colab.research.google.com/github/MaInthiyaz/OasisInfobite_Data-analytics/blob/main/Autocomplete_and_Autocorrect_Data_Analytics_()P9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Idea:  **Autocomplete and Autocorrect Data Analytics **



**Description:**



Explore the efficiency and accuracy of autocomplete and autocorrect algorithms in natural
language processing (NLP) through this data analytics project. The objective is to enhance user
experience and text prediction by analyzing large datasets and implementing or optimizing
autocomplete and autocorrect functionalities.


**Key Concepts and Challenges:**

Dataset Collection: Gather diverse text data.
NLP Preprocessing: Clean and prepare data for analysis.
Autocomplete: Implement algorithms for word/phrase predictions.
Autocorrect: Optimize algorithms for spelling error correction.
Metrics: Define and measure performance metrics.
User Experience: Assess impact through feedback and surveys.
Algorithm Comparison: Evaluate different models for efficiency and accuracy.
Visualization: Use tools for data visualization.


In [4]:
import pandas as pd  # Data manipulation
import numpy as np  # Numerical operations
import re  # Regular expressions for text cleaning
from collections import Counter  # Counting word frequency
from textblob import TextBlob  # Simple autocorrect and NLP tools
import matplotlib.pyplot as plt  # Visualization
import seaborn as sns  # Enhanced plots

# ----------------------
# Step 1: Dataset Collection
# ----------------------
# Load the credit card dataset for analysis.
data = pd.read_csv("creditcard.csv")  # Load dataset
print("Dataset Shape:", data.shape)  # Print number of rows and columns

# ----------------------
# Step 2: NLP Preprocessing (if applicable)
# ----------------------
# Check if the dataset contains a text field for autocomplete/autocorrect analysis.
if 'Description' in data.columns:

    def clean_text(text):
        """Lowercase and remove non-alphabetical characters."""
        text = str(text).lower()  # Convert to lowercase
        text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove punctuation and digits
        return text

    # Apply text cleaning to the 'Description' column.
    data['clean_description'] = data['Description'].apply(clean_text)

    # Tokenize cleaned text into words.
    tokens = []
    for sentence in data['clean_description'].dropna():
        tokens.extend(sentence.split())

    # ----------------------
    # Step 3: Word Frequency for Autocomplete
    # ----------------------
    word_counts = Counter(tokens)  # Count word frequency

    def autocomplete(prefix):
        """Suggests the most frequent words starting with the given prefix."""
        suggestions = [word for word in word_counts if word.startswith(prefix)]
        return sorted(suggestions, key=lambda x: word_counts[x], reverse=True)[:5]

    print("Autocomplete suggestions for 'fra':", autocomplete("fra"))  # Example

    # ----------------------
    # Step 4: Autocorrect with TextBlob
    # ----------------------
    def autocorrect(word):
        """Returns the most probable correction for a given word."""
        blob = TextBlob(word)
        return str(blob.correct())

    print("Autocorrect suggestion for 'fraudlent':", autocorrect("fraudlent"))

    # ----------------------
    # Step 5: Metrics and Visualization
    # ----------------------
    # Visualize the top 10 most common words.
    most_common_words = word_counts.most_common(10)
    words, counts = zip(*most_common_words)

    plt.figure(figsize=(10, 6))
    sns.barplot(x=list(counts), y=list(words), palette="viridis")
    plt.title("Top 10 Most Common Words in Transaction Descriptions")
    plt.xlabel("Count")
    plt.ylabel("Words")
    plt.show()

    # ----------------------
    # Step 6: User Experience Simulation
    # ----------------------
    user_feedback = np.random.randint(7, 10, size=10)  # Simulate user ratings
    print("User Satisfaction Scores:", user_feedback)
    print("Average User Score:", np.mean(user_feedback))

    # ----------------------
    # Step 7: Algorithm Comparison Notes
    # ----------------------
    print("Consider testing SymSpell, Levenshtein Distance, or transformer-based models for advanced autocomplete/autocorrect.")

else:
    print("No suitable text-based column (e.g., 'Description') found in the dataset for autocomplete/autocorrect analysis.")


Dataset Shape: (284807, 31)
No suitable text-based column (e.g., 'Description') found in the dataset for autocomplete/autocorrect analysis.
