# Self-learning machine learning

## Preliminary understanding

### What is Machine Learning?

Machine learning (机器学习) is a subfield of artificial intelligence (AI) that focuses on developing algorithms and statistical models that enable computers to perform tasks without being explicitly programmed to do so. In simpler terms, it's about making computers learn from data so they can make decisions or predictions on their own.

### Its usage:

1. **Natural Language Processing (自然语言处理)**: For tasks like language translation, sentiment analysis, and chatbots.
  
2. **Computer Vision (计算机视觉)**: For image recognition, object detection, and facial recognition.
  
3. **Healthcare (医疗保健)**: For diagnosing diseases based on medical images or data.
  
4. **Finance (金融)**: For stock prediction, fraud detection, and customer segmentation.
  
5. **Autonomous Vehicles (自动驾驶车辆)**: For navigation and obstacle avoidance.

6. **Recommendation Systems (推荐系统)**: For personalized content or product recommendations, like those used by Netflix or Amazon.

### Why is it important?

Machine learning is essential because it allows us to solve problems that are too complex to be solved through traditional programming. It also automates decision-making processes, making systems more efficient and intelligent.

***

## Some concepts

### Basic concepts

1. **Data Sets (数据集)**: A collection of related data points or samples used for machine learning tasks. These are often divided into training, validation, and test sets.
   - **Training Set (训练集)**: Used to train the machine learning model.
   - **Validation Set (验证集)**: Used to tune the model's hyperparameters and to provide an unbiased evaluation during training.
   - **Test Set (测试集)**: Used to evaluate the model's performance after training is complete.

2. **Sample (样本)**: An individual data point in a dataset. A sample consists of one or more features that describe its properties.

3. **Feature (特征)**: An attribute or property of a sample, used as an input variable for making predictions. For example, in a dataset about houses, the features could include the number of rooms, square footage, etc.

4. **Feature Value (特征值)**: The actual value corresponding to a specific feature for a given sample. For example, if the feature is "number of rooms," the feature value might be 3 or 4.

5. **Sample Space (样本空间)**: The entire set of all possible outcomes or samples that could be observed or collected. In the context of machine learning, this would be the complete set of possible feature vectors.

6. **Feature Vector (特征向量)**: A list of feature values associated with a sample. For instance, if we are classifying fruits and we have features like color, weight, and diameter, a feature vector for an apple might be [red, 150g, 3 inches].

### advanced concepts

1. **Model (模型)**: In the context of machine learning, a model is a mathematical representation that captures patterns in the data. Once trained, the model can ***make predictions or decisions*** based on new input data. Different types of models include ***decision trees(决策树)***, ***neural networks(神经网络)***, and ***support vector machines(支持向量机)***, among others.

2. **Label (标签)**: In supervised learning, each sample in the dataset is associated with a label, which is the "truth" or the outcome that the model aims to predict. For example, in a spam detection model, the labels could be "spam" or "not spam."

3. **Classification & Regression (分类与回归)**:
   - **Classification (分类)**: A type of ***supervised learning*** where the goal is to categorize samples into one of two or more classes. For example, determining whether an email is spam or not is a classification problem. ***(Discrete)***
   - **Regression (回归)**: Another type of ***supervised learning***, but here the goal is to predict a continuous numerical value. For example, predicting the price of a house based on its features is a regression problem. ***(Continuous)***

4. **Clustering (聚类)**: This is a type of ***unsupervised learning*** where the objective is to group similar samples together based on their features. Unlike classification, there are no pre-defined labels in clustering. For example, customer segmentation in marketing can be achieved using clustering.

### Algorithm classification


1. **Supervised Learning (监督学习)**:
   - **What**: Here, ***you have labeled data***, meaning you know the "answer" for each piece of data. ***You train*** the model to learn from this data.
   - **Example**: Think of it like learning to cook with a recipe. You know what the final dish should look like (the label), and you have specific instructions (the features) to get there.
   - **Common Uses**: Spam detection, image classification, and customer churn prediction.
  
2. **Unsupervised Learning (无监督学习)**:
   - **What**: In this type, ***you don't have labeled data***. The model tries to find patterns or groupings in the data ***on its own***.
   - **Example**: Imagine sorting a pile of different fruits without knowing their names. You'd likely group them by color, size, or shape.
   - **Common Uses**: Customer segmentation, anomaly detection, and natural language topic extraction.

3. **Reinforcement Learning (强化学习)**:
   - **What**: This is more like learning ***by trial and error***. The model (often called an "agent") takes actions and receives rewards or penalties. It learns to make better decisions over time.
   - **Example**: Think of training a dog. If it sits on command, you give it a treat (reward). If it doesn't, there's no treat (penalty).
   - **Common Uses**: Game playing (like chess or Go), robotics, and certain types of optimization problems.

***

## Six steps when modeling using machine learning

1. **Define the Problem (定义问题)**: Clearly state what you are trying to solve. Identify the type of problem (e.g., classification, regression) and the target variable or label.

2. **Data Understanding (数据理解)**: Examine the dataset to get a sense of its structure, features, and any potential issues that might need addressing.

3. **Data Preparation (数据准备)**: Clean and preprocess the data. This might involve dealing with missing values, encoding categorical variables, or scaling features.

4. **Evaluate Algorithms (评估算法)**: Choose machine learning algorithms that are appropriate for your problem. Train these algorithms on your dataset and evaluate their performance using metrics like accuracy for classification or mean squared error for regression.

5. **Optimize Model (优化模型)**: Fine-tune the model's hyperparameters or try different algorithms to improve performance. This can also involve feature engineering to enhance the model.

6. **Deploy Results (结果部署)**: Once you have a well-performing model, integrate it into a production environment where it can start taking in new data and making predictions or decisions.