This project demonstrates how to build a Language Detection Model using Naive Bayes and a Bag-of-Words (CountVectorizer) approach.
It trains a machine learning model to classify text into different languages and saves the trained model as a pipeline for reuse.
- Loads and preprocesses a dataset of text samples with their corresponding languages.
- Cleans text (removes numbers, special characters, converts to lowercase).
- Converts text into numeric vectors using CountVectorizer (Bag of Words).
- Trains a Naive Bayes classifier for language detection.
- Evaluates performance using Accuracy and F1 Score.
- Creates a Pipeline for streamlined training and prediction.
- Saves the trained model using Pickle for later use.
- Predicts the language of new unseen text.
.
├── Language Detection.csv # Dataset file
├── language_detection.py # Main script (your code)
├── trained_pipeline-0.1.0.pkl # Saved trained pipeline
└── README.md # Documentation
-
Clone this repository:
git clone https://github.com/your-username/language-detection.git cd language-detection
-
Install required Python libraries:
pip install pandas numpy scikit-learn seaborn matplotlib
-
Load Dataset
LoadLanguage Detection.csv
containing text and their corresponding language labels. -
Preprocessing
- Remove special characters, numbers, and extra symbols.
- Convert text to lowercase.
-
Encoding Labels
- Convert categorical language labels into numerical values using
LabelEncoder
.
- Convert categorical language labels into numerical values using
-
Train-Test Split
- Split dataset into 80% training and 20% testing sets.
-
Vectorization (Bag of Words)
- Convert text into numeric vectors using
CountVectorizer
.
- Convert text into numeric vectors using
-
Model Training
- Train a Multinomial Naive Bayes classifier on the training data.
-
Evaluation
- Compute Accuracy and F1 Score on test data.
-
Pipeline
- Create a Scikit-learn pipeline with
CountVectorizer + Naive Bayes
. - Save the pipeline using
pickle
.
- Create a Scikit-learn pipeline with
-
Prediction
- Load pipeline and predict language of new text (e.g.,
"Ciao, come stai?"
→"Italian"
).
- Load pipeline and predict language of new text (e.g.,
Run the Python script:
python language_detection.py
Example prediction inside the script:
text = "Ciao, come stai?"
y = pipe.predict([text])
print("Detected Language:", le.classes_[y[0]])
Output:
Detected Language: Italian
- Accuracy: Overall correct predictions.
- F1 Score: Balanced measure of precision and recall (useful for imbalanced datasets).
The trained model pipeline is saved as:
trained_pipeline-0.1.0.pkl
You can load it later and make predictions without retraining:
import pickle
with open("trained_pipeline-0.1.0.pkl", "rb") as f:
pipe = pickle.load(f)
print(pipe.predict(["Hello, how are you?"])) # Output: English
The project includes a Dockerfile so it can be containerized.
- Build the Docker image.
- Run the container and map it to a port on your machine.
- Access the API endpoints and documentation through your browser or API client.
The application can be started with Uvicorn, making the API available on your local machine. It exposes two endpoints:
- Health Check → returns service status and model version.
- Prediction → accepts input text and returns the detected language.
- Interactive documentation is automatically available through the FastAPI Swagger UI.
- Connect your GitHub repository to Render.
- Create a new Web Service and choose Docker environment.
- Render automatically builds the image using your Dockerfile.
- Set the start command (for FastAPI in Docker it’s handled by the base image).
- Expose port 80 (or the port your app is running on).
- Once deployed, your API is available at a public Render URL with full /docs support.
- Python 3.7+
- pandas
- numpy
- scikit-learn
- seaborn
- matplotlib
This project is licensed under the MIT License.